Skip to content

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Jan 12, 2026


Summary by cubic

Unifies the two event systems into one Avro-backed domain event model, removing the infrastructure/kafka/events layer. Simplifies serialization, routing, DLQ handling, and tests; adds DLQ telemetry events, configurable Kafka timeouts, and small SSE shutdown hooks.

  • Refactors

    • Domain events in domain/events/typed.py now extend AvroBase; EventMetadata replaces AvroEventMetadata.
    • Removed app/infrastructure/kafka/events/*; all imports and type hints now use DomainEvent.
    • SchemaRegistryManager and mappings updated to serialize/deserialize DomainEvent and route by EventType.
    • Producer/consumer/dispatcher and services updated accordingly; DLQ models/docs embed DomainEvent.
    • Added KafkaTopic.DLQ_EVENTS and EventType.DLQ_MESSAGE_RECEIVED/RETRIED/DISCARDED; DLQ manager publishes telemetry.
    • Repositories use typed Beanie operators and monggregate pipelines; DLQ stats and docs are strongly typed.
    • Centralized consumer group→topic subscriptions (CONSUMER_GROUP_SUBSCRIPTIONS); worker/bridge configs read from it.
    • Kafka client timeouts (session/heartbeat/request/max poll) are configurable; tests use shorter values.
  • Migration

    • Switch imports to app.domain.events.typed.* (was app.infrastructure.kafka.events.*).
    • Rename AvroEventMetadata to EventMetadata.
    • Update type hints to DomainEvent/BaseEvent where applicable.
    • In DLQ, use message.event.event_id and message.event.event_type.
    • Replace ev.topic with get_topic_for_event(ev.event_type).
    • Update DLQStats consumers to handle typed by_topic/by_event_type items and AgeStatistics.
    • UnifiedProducer no longer takes ProducerConfig; pass settings directly to UnifiedProducer.

Written for commit d4a7bec. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • DLQ lifecycle events (received / retried / discarded) plus manual and batch retry/discard endpoints.
  • Improvements

    • Richer DLQ messages with extended metadata, timestamps and history; improved processing, metrics and tracing.
    • Unified domain event model with centralized event→topic routing and Avro-compatible schemas.
    • Configurable Kafka consumer timeouts and a higher MongoDB ulimits setting.
  • Tests

    • Tests migrated to domain events and made deterministic (polling removed; direct awaits/futures).
  • Documentation

    • Architecture and schema docs updated for the unified event model.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Domain events were centralized into an Avro-backed DomainEvent union; DLQ models/repositories converted to Pydantic/Beanie with nested event: DomainEvent; DLQManager refactored to a single async consumer loop that persists DLQ messages and emits DLQ lifecycle events, adding manual and batch retry/discard APIs; many modules and tests updated to use DomainEvent and centralized topic mappings.

Changes

Cohort / File(s) Summary
DLQ system
backend/app/dlq/models.py, backend/app/db/docs/dlq.py, backend/app/dlq/manager.py, backend/app/api/routes/dlq.py, backend/app/db/repositories/dlq_repository.py, backend/tests/integration/dlq/*, backend/tests/integration/dlq/*
Migrate DLQ models to Pydantic/Beanie with nested event: DomainEvent and from_attributes=True; repo read/write uses model_validate/model_dump; DLQManager unified into a single async consumer loop, emits DLQ lifecycle events and adds retry_message_manually, retry_messages_batch, discard_message_manually; routes and tests updated for nested event fields.
Domain events & enums
backend/app/domain/events/typed.py, backend/app/domain/enums/events.py, backend/app/domain/enums/kafka.py
Introduce Avro-backed typed events and EventMetadata; add DLQ lifecycle event classes and EventType enum members; add DLQ_EVENTS topic and extend DomainEvent union (discriminator on event_type).
Event infra & mappings
backend/app/infrastructure/kafka/mappings.py, backend/app/infrastructure/kafka/__init__.py, backend/app/infrastructure/kafka/events/*
Remove legacy infra-level event modules and re-exports; introduce EVENT_TYPE_TO_TOPIC, get_topic_for_event, get_event_class_for_type; routing moved to mappings and DomainEvent typed module.
Producer / Consumer / Schema
backend/app/events/core/producer.py, backend/app/events/core/consumer.py, backend/app/events/core/dispatcher.py, backend/app/events/schema/schema_registry.py, backend/app/events/core/types.py
Replace BaseEvent with DomainEvent across producer/consumer/dispatcher/schema registry; producer derives topics from mappings and serializes DomainEvent; SchemaRegistryManager and dispatcher adapted for DomainEvent union; consumer/producer config signatures updated.
Repositories & queries
backend/app/db/repositories/*.py, backend/app/db/docs/dlq.py
Switch dict/Eq filters to Beanie field expressions; adopt monggregate Pipeline/S for aggregations; remove legacy query builders; repo methods return Pydantic models via model_validate.
Services & workers
backend/app/services/**, backend/workers/dlq_processor.py, backend/app/events/*
Widespread import swap to DomainEvent/EventMetadata; rename publish API to publish_domain_event; use centralized CONSUMER_GROUP_SUBSCRIPTIONS; extend consumer configs with session/heartbeat/max_poll/request timeouts; DLQ worker/manager use nested event fields and model-based persistence.
Tests & helpers
backend/tests/**, backend/tests/helpers/*
Tests updated to DomainEvent/EventMetadata and nested event fields; removed polling helper eventually in many tests in favor of futures/asyncio.wait_for and pre-produce patterns; imports adjusted to new module locations.
Docs, packaging & config
docs/architecture/*, backend/pyproject.toml, backend/.env.test, docker-compose.yaml, backend/app/settings.py
Documentation updated to describe Avro-backed DomainEvent and centralized mappings; monggregate added to dependencies; test env and Kafka timeout settings added; docker-compose ulimits added; Settings include new Kafka timeout fields.

Sequence Diagram(s)

sequenceDiagram
    participant Prod as UnifiedProducer
    participant Schema as SchemaRegistryManager
    participant Kafka as Kafka Broker
    participant DLQMgr as DLQManager
    participant Mongo as MongoDB
    participant DLQEvents as DLQ Events Consumer

    Prod->>Schema: serialize_event(DomainEvent)
    Schema-->>Prod: bytes
    Prod->>Kafka: publish to EVENT_TYPE_TO_TOPIC[event.event_type]
    alt publish fails or send_to_dlq invoked
        Prod->>Kafka: publish DLQ payload to DLQ topic
    end
    Kafka->>DLQMgr: deliver DLQ topic message
    DLQMgr->>Mongo: persist DLQMessage (message.model_dump())
    DLQMgr->>Kafka: emit DLQ_MESSAGE_RECEIVED on DLQ_EVENTS
    Kafka->>DLQEvents: deliver DLQ_MESSAGE_RECEIVED
    DLQEvents->>DLQMgr: request manual/batch retry or discard
    alt retry requested
        DLQMgr->>Kafka: republish original DomainEvent
        DLQMgr->>Kafka: emit DLQ_MESSAGE_RETRIED
    else discard requested
        DLQMgr->>Mongo: update status = DISCARDED
        DLQMgr->>Kafka: emit DLQ_MESSAGE_DISCARDED
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐇 I hopped through types and nested event trees,
Avro-wrapped DomainEvents dancing on breeze.
DLQ messages land, saved, and then told,
Retries and discards now ring as events bold.
No more polling — futures leap, quick and pleased.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '2 events system (domain, avro) -> single one' accurately and concisely summarizes the main change: consolidating two event systems into a single unified model.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
backend/app/events/core/producer.py (1)

63-71: Consider making producer settings configurable.

The producer configuration values (acks, compression, batch size, linger, idempotence) are now hard-coded. While these are reasonable defaults, consider exposing at least max_batch_size and linger_ms through Settings for tuning in different environments.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61cf497 and d4a7bec.

📒 Files selected for processing (11)
  • backend/.env
  • backend/app/core/providers.py
  • backend/app/events/core/__init__.py
  • backend/app/events/core/producer.py
  • backend/app/events/core/types.py
  • backend/app/settings.py
  • backend/tests/integration/events/test_producer_roundtrip.py
  • backend/tests/unit/events/core/test_consumer_config.py
  • backend/tests/unit/events/test_mappings_and_types.py
  • docs/reference/environment-variables.md
  • docs/testing/kafka-test-stability.md
💤 Files with no reviewable changes (3)
  • backend/tests/unit/events/test_mappings_and_types.py
  • backend/app/events/core/init.py
  • backend/tests/unit/events/core/test_consumer_config.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/tests/integration/events/test_producer_roundtrip.py
  • docs/testing/kafka-test-stability.md
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/core/providers.py (1)
backend/app/events/core/producer.py (2)
  • UnifiedProducer (23-215)
  • producer (55-56)
🪛 dotenv-linter (4.0.0)
backend/.env

[warning] 27-27: [UnorderedKey] The KAFKA_SESSION_TIMEOUT_MS key should go before the SCHEMA_REGISTRY_URL key

(UnorderedKey)


[warning] 28-28: [UnorderedKey] The KAFKA_HEARTBEAT_INTERVAL_MS key should go before the KAFKA_SESSION_TIMEOUT_MS key

(UnorderedKey)


[warning] 29-29: [UnorderedKey] The KAFKA_REQUEST_TIMEOUT_MS key should go before the KAFKA_SESSION_TIMEOUT_MS key

(UnorderedKey)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Scan Backend
  • GitHub Check: Scan Frontend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (9)
backend/app/events/core/types.py (1)

42-45: LGTM! Configuration values are well-tuned.

The updated Kafka consumer configuration follows best practices:

  • heartbeat_interval_ms (10000) is appropriately less than session_timeout_ms / 3 (15000)
  • request_timeout_ms (40000) is slightly below session_timeout_ms (45000) to avoid premature session expiry

The past review concern about request_timeout_ms not being used in to_consumer_config() is no longer applicable since that method has been removed from this PR.

backend/app/events/core/producer.py (2)

108-145: Approve the unified produce() method with DomainEvent.

The method correctly:

  • Serializes via schema registry
  • Converts headers to the expected tuple format
  • Updates metrics on success/failure
  • Properly raises exceptions for upstream handling

147-214: LGTM! DLQ handling is well-implemented.

The send_to_dlq method properly:

  • Captures producer context (hostname + task name) for traceability
  • Serializes the DomainEvent using model_dump(mode="json") for flexibility
  • Includes comprehensive headers for debugging
  • Has appropriate fallback logging when DLQ send fails
backend/app/core/providers.py (2)

46-46: LGTM! Import simplified after ProducerConfig removal.


160-164: LGTM! Producer instantiation correctly updated.

The UnifiedProducer constructor call now matches the simplified signature (schema_registry, logger, settings) after the ProducerConfig removal.

backend/.env (1)

27-29: LGTM! Kafka timeout configuration is consistent and correctly tuned.

The new environment variables match the defaults in ConsumerConfig:

  • KAFKA_HEARTBEAT_INTERVAL_MS (10000) satisfies the Kafka requirement of being less than session_timeout / 3
  • KAFKA_REQUEST_TIMEOUT_MS (40000) is appropriately below the session timeout

The static analysis warnings about key ordering are cosmetic—keeping Kafka variables grouped together is more maintainable than strict alphabetical ordering.

docs/reference/environment-variables.md (1)

46-62: LGTM! Kafka documentation is comprehensive and accurate.

The updated environment variable reference correctly documents all new Kafka timing parameters with:

  • Accurate default values matching the code
  • Clear descriptions with explicit unit notation (ms)
  • Consistent table formatting
backend/app/settings.py (2)

63-67: Kafka timeout settings look well-configured.

The new timeout values follow Kafka best practices:

  • session.timeout.ms (45000) ≥ 3 × heartbeat.interval.ms (10000) ✓
  • request.timeout.ms (40000) < session.timeout.ms (45000) ✓
  • max.poll.interval.ms (300000 = 5 min) provides adequate headroom for processing batches

These changes should improve consumer stability and reduce spurious rebalances.


166-166: The review comment is based on an incorrect assumption about the code history. The env_file parameter was hardcoded to ".env" when this settings file was created, not changed from a dynamic path resolution. Additionally, DOTENV_PATH is not referenced anywhere in the codebase and does not appear to be part of the deployment strategy.

If dynamic environment file path selection is a genuine requirement for future deployments, that would be a valid enhancement request. However, the current implementation does not represent a regression from a previously flexible system, as the comment implies.

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 86.97674% with 56 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/dlq/manager.py 60.00% 20 Missing ⚠️
backend/app/db/repositories/saga_repository.py 36.36% 7 Missing ⚠️
...p/db/repositories/admin/admin_events_repository.py 70.00% 6 Missing ⚠️
backend/app/db/repositories/user_repository.py 33.33% 4 Missing ⚠️
backend/app/db/repositories/replay_repository.py 25.00% 3 Missing ⚠️
backend/app/db/repositories/sse_repository.py 0.00% 2 Missing ⚠️
backend/app/services/notification_service.py 60.00% 2 Missing ⚠️
backend/app/api/routes/dlq.py 75.00% 1 Missing ⚠️
...app/db/repositories/admin/admin_user_repository.py 91.66% 1 Missing ⚠️
backend/app/db/repositories/event_repository.py 66.66% 1 Missing ⚠️
... and 9 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Flag Coverage Δ
backend-e2e 57.46% <54.18%> (-1.36%) ⬇️
backend-integration 73.89% <83.95%> (-0.91%) ⬇️
backend-unit 60.02% <53.95%> (-1.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
backend/app/api/routes/events.py 71.55% <100.00%> (-0.26%) ⬇️
backend/app/api/routes/execution.py 77.22% <100.00%> (-0.23%) ⬇️
backend/app/core/providers.py 95.10% <100.00%> (-0.02%) ⬇️
backend/app/db/docs/dlq.py 100.00% <100.00%> (+5.55%) ⬆️
...db/repositories/admin/admin_settings_repository.py 100.00% <100.00%> (ø)
backend/app/db/repositories/dlq_repository.py 100.00% <100.00%> (+1.26%) ⬆️
...ackend/app/db/repositories/execution_repository.py 92.59% <100.00%> (ø)
...end/app/db/repositories/notification_repository.py 80.70% <100.00%> (ø)
...end/app/db/repositories/saved_script_repository.py 100.00% <100.00%> (ø)
backend/app/dlq/models.py 98.38% <100.00%> (-0.06%) ⬇️
... and 49 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/tests/unit/services/saga/test_execution_saga_steps.py">

<violation number="1" location="backend/tests/unit/services/saga/test_execution_saga_steps.py:6">
P3: Duplicate import source: `app.domain.events.typed` is imported twice. Consider consolidating with the existing import on line 3: `from app.domain.events.typed import BaseEvent, ExecutionRequestedEvent`</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 issues found across 92 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed.

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/architecture/model-conversion.md">

<violation number="1" location="docs/architecture/model-conversion.md:28">
P3: Path inconsistency in architecture diagram: `app/domain/events/typed.py` is placed in the Infrastructure Layer box, but `app/domain/` is shown as the Domain Layer. Consider either moving the events to an infrastructure path (e.g., `app/infrastructure/events/typed.py`) to match the layer designation, or clarify in the documentation why Pydantic event models live within the domain directory structure.</violation>
</file>

<file name="backend/tests/integration/dlq/test_dlq_manager.py">

<violation number="1" location="backend/tests/integration/dlq/test_dlq_manager.py:60">
P2: Broad exception catching with debug-level logging may mask test failures. If deserialization consistently fails, the test will timeout after 15 seconds with no actionable error message. Consider logging at warning/error level, or re-raising after a few failures to make issues visible.</violation>
</file>

<file name="backend/tests/integration/idempotency/test_consumer_idempotent.py">

<violation number="1" location="backend/tests/integration/idempotency/test_consumer_idempotent.py:84">
P1: Test no longer verifies idempotency correctly. The Future resolves after the first event, but the second duplicate event may not have been processed yet. The assertion `seen["n"] >= 1` doesn't prove duplicates were blocked—it would pass even if idempotency is broken. Consider waiting for both events to be attempted (e.g., brief delay after Future resolves) and asserting `seen["n"] == 1` to verify exactly one event was processed.</violation>
</file>

<file name="backend/app/dlq/models.py">

<violation number="1" location="backend/app/dlq/models.py:52">
P0: Breaking change: `age_seconds` property was removed but is still used by `dlq_processor.py` for filtering old messages and by `DLQMessageResponse` schema. This will cause `AttributeError` at runtime.</violation>
</file>

<file name="backend/app/events/core/producer.py">

<violation number="1" location="backend/app/events/core/producer.py:121">
P2: Using direct dictionary access `EVENT_TYPE_TO_TOPIC[...]` may raise `KeyError` if an event type is not in the mapping. Consider using the safer `get_topic_for_event()` function from the same module, which provides a default fallback to `KafkaTopic.SYSTEM_EVENTS`.</violation>
</file>

<file name="backend/app/services/notification_service.py">

<violation number="1" location="backend/app/services/notification_service.py:330">
P2: Changing from `asyncio.create_task()` to `await` blocks the caller until notification delivery completes. Since `_deliver_notification` can make external HTTP calls (webhooks/Slack with 30s timeouts), this could significantly increase latency for callers of `create_notification`. If this blocking behavior is intentional, consider whether the performance impact on API response times is acceptable. Otherwise, restore the background task pattern.</violation>
</file>

<file name="docs/components/schema-manager.md">

<violation number="1" location="docs/components/schema-manager.md:21">
P3: Inconsistent path format: should be `app/domain/events/typed.py` to match the convention used elsewhere in this document (e.g., `app/db/schema/schema_manager.py`).</violation>
</file>

<file name="backend/app/dlq/manager.py">

<violation number="1" location="backend/app/dlq/manager.py:75">
P1: Missing null safety in header decoding. If any Kafka header value is `None`, this will crash with `AttributeError`. The old code handled this with `v.decode("utf-8") if isinstance(v, (bytes, bytearray)) else (v or "")`.</violation>

<violation number="2" location="backend/app/dlq/manager.py:137">
P2: Missing error backoff sleep after exception. If persistent errors occur (e.g., database unavailable), the loop will spin rapidly without pause, causing excessive logging and CPU usage. The old code had `await asyncio.sleep(5)` after errors.</violation>
</file>

<file name="backend/app/services/sse/sse_shutdown_manager.py">

<violation number="1" location="backend/app/services/sse/sse_shutdown_manager.py:68">
P1: The `initiated_event` is created but never set. External code awaiting this event will hang indefinitely. Add `self.initiated_event.set()` in `initiate_shutdown()` after setting `_shutdown_initiated = True`.</violation>

<violation number="2" location="backend/app/services/sse/sse_shutdown_manager.py:69">
P1: The `notifying_event` is created but never set. External code awaiting this event will hang indefinitely. Add `self.notifying_event.set()` in `_notify_connections()` after setting the phase to NOTIFYING.</violation>
</file>

<file name="backend/app/events/core/dispatcher.py">

<violation number="1" location="backend/app/events/core/dispatcher.py:29">
P2: Dead code: `_topic_event_types` is initialized but never populated or used. The `_build_topic_mapping` method that populated this attribute was removed, but the attribute declaration was left behind. Consider removing this unused attribute.</violation>
</file>

<file name="backend/app/services/kafka_event_service.py">

<violation number="1" location="backend/app/services/kafka_event_service.py:201">
P2: Logging was reduced to only `event_id`, removing `event_type` and `aggregate_id` that were previously logged. This makes debugging and tracing events harder. Consider adding these fields back for consistency with `publish_event` and better observability.</violation>
</file>

<file name="backend/app/services/coordinator/coordinator.py">

<violation number="1" location="backend/app/services/coordinator/coordinator.py:246">
P1: Off-by-one error: queue positions are 1-indexed, so the front of queue is `position == 1`, not `position == 0`. The current check will never trigger for items at the front of the queue - it only triggers when position was `None` (error case) and defaulted to 0.</violation>
</file>

<file name="backend/tests/unit/services/idempotency/test_middleware.py">

<violation number="1" location="backend/tests/unit/services/idempotency/test_middleware.py:31">
P2: `MagicMock(spec=DomainEvent)` is problematic because `DomainEvent` is an `Annotated` union type alias, not a class. The `spec` parameter is designed for concrete classes. Consider using `spec=BaseEvent` instead, which is the base class that defines `event_type` and `event_id`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
backend/app/services/notification_service.py (1)

631-633: Fix null-pointer dereference on resource_usage.

The pipeline failure indicates resource_usage can be None (typed as ResourceUsageDomain | None), but line 632 accesses .execution_time_wall_seconds without a null check. This will raise AttributeError at runtime.

🐛 Proposed fix
     title = f"Execution Completed: {event.execution_id}"
-    body = (
-        f"Your execution completed successfully. Duration: {event.resource_usage.execution_time_wall_seconds:.2f}s."
-    )
+    if event.resource_usage:
+        body = f"Your execution completed successfully. Duration: {event.resource_usage.execution_time_wall_seconds:.2f}s."
+    else:
+        body = "Your execution completed successfully."
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

30-43: Pipeline failure: MyPy reports BaseEvent has type "Any".

The metadata field update to EventMetadata is correct. However, MyPy reports that BaseEvent (imported from app.infrastructure.kafka.events on line 17) has type Any, causing the "Class cannot subclass" error.

This suggests BaseEvent may lack proper type annotations or exports. Consider:

  1. Ensuring BaseEvent is properly typed in app.infrastructure.kafka.events
  2. Or updating _FakeEvent to extend from DomainEvent (from app.domain.events.typed) to align with the migration
#!/bin/bash
# Check BaseEvent definition and exports in infrastructure module
ast-grep --pattern 'class BaseEvent($_) {
  $$$
}'

# Also check if there's a __all__ that might be missing BaseEvent
rg -n "BaseEvent" --type py -C 3 backend/app/infrastructure/kafka/events/
backend/tests/unit/services/saga/test_saga_step_and_base.py (1)

75-76: Type error: SagaStep[BaseEvent] generic constraint violation.

SagaStep is generic over DomainEvent, which is a union of concrete event types. Using BaseEvent as the type argument violates this constraint. Replace with a concrete event type or use DomainEvent directly if the step should handle any domain event.

🐛 Proposed fix
         def get_steps(self) -> list[SagaStep[BaseEvent]]:
+        def get_steps(self) -> list[SagaStep[DomainEvent]]:
             return []
-    class S(SagaStep[BaseEvent]):
-        async def execute(self, context: SagaContext, event: BaseEvent) -> bool:
+    class S(SagaStep[DomainEvent]):
+        async def execute(self, context: SagaContext, event: DomainEvent) -> bool:
             return True

Also add the import at line 6:

-from app.domain.events.typed import BaseEvent, EventMetadata
+from app.domain.events.typed import BaseEvent, DomainEvent, EventMetadata

Also applies to: 82-84

backend/tests/integration/events/test_schema_registry_real.py (1)

17-24: PodCreatedEvent has no topic attribute.

The pipeline fails because PodCreatedEvent doesn't expose a topic attribute. Based on the PR's migration to domain events, topic resolution is now handled via centralized mappings (likely in mappings.py), not as an event attribute.

🐛 Proposed fix using topic mapping or constant
+from app.domain.enums.kafka import KafkaTopic

Then update the deserialization call:

-    obj = await m.deserialize_event(data, topic=str(ev.topic))
+    obj = await m.deserialize_event(data, topic=KafkaTopic.POD_EVENTS.value)

Alternatively, if there's a topic resolver utility:

+from app.infrastructure.kafka.mappings import get_topic_for_event
-    obj = await m.deserialize_event(data, topic=str(ev.topic))
+    obj = await m.deserialize_event(data, topic=get_topic_for_event(ev))
backend/workers/dlq_processor.py (1)

67-71: DLQMessage does not have an age_seconds attribute—use created_at to compute message age.

The filter function accesses message.age_seconds, which does not exist on the DLQMessage class. The class has a created_at datetime field that should be used instead.

Fix
+from datetime import datetime, timezone

 def filter_old_messages(message: DLQMessage) -> bool:
     max_age_days = 7
-    return message.age_seconds < (max_age_days * 24 * 3600)
+    max_age_seconds = max_age_days * 24 * 3600
+    age = (datetime.now(timezone.utc) - message.created_at).total_seconds()
+    return age < max_age_seconds
backend/tests/integration/services/sse/test_redis_bus.py (1)

21-28: Change BaseEvent import to correct module path.

Line 12 imports BaseEvent from a non-existent module app.infrastructure.kafka.events. BaseEvent is defined in app.domain.events.typed and exported from app.domain.events. Change the import to:

from app.domain.events import BaseEvent

This non-existent module causes MyPy to treat BaseEvent as type Any, which triggers the subclass error for _DummyEvent.

backend/tests/unit/services/sse/test_kafka_redis_bridge.py (1)

26-29: Type mismatch: _FakeBus.publish_event signature differs from SSERedisBus.

The real SSERedisBus.publish_event now accepts DomainEvent (line 65 in redis_bus.py), but this fake uses BaseEvent. This inconsistency could mask integration issues and cause type errors.

🔧 Proposed fix
+from app.domain.events.typed import DomainEvent
+
 class _FakeBus(SSERedisBus):
     """Fake SSERedisBus for testing."""

     def __init__(self) -> None:
-        self.published: list[tuple[str, BaseEvent]] = []
+        self.published: list[tuple[str, DomainEvent]] = []

-    async def publish_event(self, execution_id: str, event: BaseEvent) -> None:
+    async def publish_event(self, execution_id: str, event: DomainEvent) -> None:
         self.published.append((execution_id, event))
backend/app/services/result_processor/processor.py (2)

177-183: Fix null dereference on resource_usage - pipeline failure.

ExecutionCompletedEvent.resource_usage is typed as ResourceUsageDomain | None. Accessing .execution_time_wall_seconds and .peak_memory_kb without a null check will raise AttributeError when resource_usage is None.

Proposed fix
-        runtime_seconds = event.resource_usage.execution_time_wall_seconds
-        self._metrics.record_execution_duration(runtime_seconds, lang_and_version)
-
-        # Record memory utilization
-        memory_mib = event.resource_usage.peak_memory_kb / 1024
+        if event.resource_usage is not None:
+            runtime_seconds = event.resource_usage.execution_time_wall_seconds
+            self._metrics.record_execution_duration(runtime_seconds, lang_and_version)
+
+            # Record memory utilization
+            memory_mib = event.resource_usage.peak_memory_kb / 1024
+        else:
+            self.logger.warning(f"No resource_usage for completed execution {event.execution_id}")
+            runtime_seconds = 0.0
+            memory_mib = 0.0

Also guard the memory utilization percentage calculation (lines 186-191) since it depends on memory_mib.


218-218: Fix type mismatch in record_error call - pipeline failure.

event.error_type is ExecutionErrorType | None, but record_error expects a str. Additionally, if error_type is None, this will fail.

Proposed fix
-        self._metrics.record_error(event.error_type)
+        if event.error_type is not None:
+            self._metrics.record_error(event.error_type.value if hasattr(event.error_type, 'value') else str(event.error_type))

Or if record_error should accept the enum directly, update its signature instead.

backend/tests/integration/dlq/test_dlq_manager.py (1)

63-70: Fix: DomainEvent uses model_dump(), not to_dict().

The pipeline failure indicates ExecutionRequestedEvent has no attribute to_dict. Pydantic v2 models use model_dump() for serialization.

🐛 Proposed fix
     payload = {
-        "event": ev.to_dict(),
+        "event": ev.model_dump(mode="json"),
         "original_topic": f"{prefix}{str(KafkaTopic.EXECUTION_EVENTS)}",
         "error": "handler failed",
         "retry_count": 0,
         "failed_at": datetime.now(timezone.utc).isoformat(),
         "producer_id": "tests",
     }
backend/app/events/core/producer.py (1)

119-148: Potential UnboundLocalError if event_type not in topic mapping.

If EVENT_TYPE_TO_TOPIC[event_to_produce.event_type] raises a KeyError at line 121, the topic variable will be undefined when the except KafkaError block executes at line 146, causing an UnboundLocalError.

🐛 Proposed fix: define topic before the try block or handle KeyError separately
     async def produce(
         self, event_to_produce: DomainEvent, key: str | None = None, headers: dict[str, str] | None = None
     ) -> None:
         """Produce a message to Kafka."""
         if not self._producer:
             self.logger.error("Producer not running")
             return

+        topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}"
         try:
             serialized_value = await self._schema_registry.serialize_event(event_to_produce)
-            topic = f"{self._topic_prefix}{EVENT_TYPE_TO_TOPIC[event_to_produce.event_type]}"

             # Convert headers to list of tuples format
             header_list = [(k, v.encode()) for k, v in headers.items()] if headers else None
🤖 Fix all issues with AI agents
In @backend/app/services/sse/sse_shutdown_manager.py:
- Around line 67-70: The new asyncio.Events initiated_event and notifying_event
are never set; in initiate_shutdown() call self.initiated_event.set()
immediately after transitioning state to the initiated/shutdown-started phase
(so external coordinators see shutdown begin), and in _notify_connections() call
self.notifying_event.set() right before or as you begin the notifying/broadcast
phase so external observers know notifying has started; ensure these calls are
safe to call idempotently (no blocking) and placed before any awaits that yield
control.

In @backend/tests/integration/services/sse/test_partitioned_event_router.py:
- Around line 49-51: The test is flaky because using
asyncio.wait_for(subscription.get(RedisSSEMessage), timeout=2.0) only protects
against hangs but subscription.get has an internal 0.5s timeout and can return
None; replace the single-call approach with a retry/poll loop (or reuse the
existing eventual() helper) that repeatedly calls
subscription.get(RedisSSEMessage) until a non-None message is returned or a
total timeout elapses, and then assert msg is not None; target the test code
around subscription.get, RedisSSEMessage, and asyncio.wait_for to implement the
retry semantics.

In @backend/tests/unit/services/saga/test_execution_saga_steps.py:
- Around line 3-6: The _FakeProducer test double currently uses BaseEvent types
which mismatches UnifiedProducer.produce's contract; update the imports to
include DomainEvent from app.domain.events.typed and change _FakeProducer so its
events list is typed as list[DomainEvent] and its async produce method
signature/parameter uses DomainEvent (not BaseEvent), matching
UnifiedProducer.produce; ensure you remove or stop using BaseEvent in
_FakeProducer and keep behavior of appending the event to self.events.

In @backend/tests/unit/services/saga/test_saga_step_and_base.py:
- Around line 46-50: The test constructs BaseEvent (an abstract base) and passes
it to SagaContext.add_event, but add_event expects a concrete DomainEvent
subtype; replace the BaseEvent instantiation with a concrete event class (e.g.,
a SystemErrorEvent or other concrete subclass that corresponds to
EventType.SYSTEM_ERROR) and pass that to ctx.add_event, or if such a concrete
class does not exist, add a concrete subclass of BaseEvent (and include it in
the DomainEvent union) with appropriate metadata and use that instance in the
test.
🧹 Nitpick comments (9)
backend/tests/integration/services/events/test_event_bus.py (1)

51-53: Consider extending assertions for robustness.

The Future-based await with timeout is cleaner than polling. Optionally, consider also asserting on the payload to ensure full event integrity:

assert received.event_type == "test.created"
assert received.payload == {"x": 1}

This would catch potential serialization/deserialization issues in the Kafka path.

backend/tests/integration/test_replay_routes.py (1)

388-392: Validate start response and align with file's error-handling pattern.

Two concerns:

  1. Line 388: The start request response is discarded. If the session fails to start, the test proceeds anyway and may produce misleading failures on line 392.

  2. Line 392: Uses strict == 200 assertion, but other tests in this file (e.g., lines 102, 135, 168) use in [200, 404] with early return. This inconsistency could cause flaky test failures if the session becomes unavailable.

Suggested fix to align with file patterns
-        await test_admin.post(f"/api/v1/replay/sessions/{session_id}/start")
-
-        # Check progress immediately - session state available right after start
-        detail_response = await test_admin.get(f"/api/v1/replay/sessions/{session_id}")
-        assert detail_response.status_code == 200
+        start_response = await test_admin.post(f"/api/v1/replay/sessions/{session_id}/start")
+        assert start_response.status_code in [200, 404]
+        if start_response.status_code != 200:
+            return
+
+        # Check progress immediately - session state available right after start
+        detail_response = await test_admin.get(f"/api/v1/replay/sessions/{session_id}")
+        assert detail_response.status_code in [200, 404]
+        if detail_response.status_code != 200:
+            return
backend/app/events/schema/schema_registry.py (1)

16-21: Consider adding defensive handling for union extraction.

The logic assumes DomainEvent is always Annotated[Union[...], Discriminator]. If the type structure changes, this could break silently. The current implementation is correct for the existing design, but consider adding validation.

💡 Optional: Add defensive check
 @lru_cache(maxsize=1)
 def _get_all_event_classes() -> list[type[DomainEvent]]:
     """Get all concrete event classes from DomainEvent union."""
     union_type = get_args(DomainEvent)[0]  # Annotated[Union[...], Discriminator] -> Union
+    if not get_args(union_type):
+        raise RuntimeError("DomainEvent union has no concrete types - check type definition")
     return list(get_args(union_type)) if get_origin(union_type) else [union_type]
backend/app/services/k8s_worker/worker.py (1)

215-225: Consider replacing assert with explicit type guards for production robustness.

Using assert isinstance(...) for type narrowing works but can be silently disabled when Python runs with the -O (optimize) flag. In production, this could lead to unexpected AttributeError exceptions instead of clear type errors.

♻️ Suggested defensive pattern
 async def _handle_create_pod_command_wrapper(self, event: DomainEvent) -> None:
     """Wrapper for handling CreatePodCommandEvent with type safety."""
-    assert isinstance(event, CreatePodCommandEvent)
+    if not isinstance(event, CreatePodCommandEvent):
+        self.logger.error(f"Expected CreatePodCommandEvent, got {type(event).__name__}")
+        return
     self.logger.info(f"Processing create_pod_command for execution {event.execution_id} from saga {event.saga_id}")
     await self._handle_create_pod_command(event)

 async def _handle_delete_pod_command_wrapper(self, event: DomainEvent) -> None:
     """Wrapper for handling DeletePodCommandEvent."""
-    assert isinstance(event, DeletePodCommandEvent)
+    if not isinstance(event, DeletePodCommandEvent):
+        self.logger.error(f"Expected DeletePodCommandEvent, got {type(event).__name__}")
+        return
     self.logger.info(f"Processing delete_pod_command for execution {event.execution_id} from saga {event.saga_id}")
     await self._handle_delete_pod_command(event)
backend/tests/integration/idempotency/test_consumer_idempotent.py (1)

82-84: Consider tightening the assertion for idempotency verification.

The assertion seen["n"] >= 1 verifies the handler ran at least once, but since two identical events are produced and the test is verifying idempotent behavior, asserting seen["n"] == 1 would more precisely confirm that the duplicate was blocked.

Suggested fix
-        assert seen["n"] >= 1
+        assert seen["n"] == 1, f"Expected exactly 1 invocation (duplicate blocked), got {seen['n']}"
backend/tests/unit/domain/events/test_event_schema_coverage.py (1)

6-6: Consider updating docstring terminology for consistency.

The docstrings still reference "Kafka event class" terminology, but after this PR's migration, all events are unified under DomainEvent. The distinction between "domain" and "Kafka" event classes is now less meaningful since they share the same BaseEvent hierarchy.

This is a minor documentation nit that could be addressed in a follow-up.

Also applies to: 52-54

backend/app/infrastructure/kafka/mappings.py (1)

102-104: Consider caching for get_event_types_for_topic.

Unlike get_event_class_for_type and get_topic_for_event, this function iterates through the entire mapping on each call. If called frequently, consider adding @lru_cache.

♻️ Optional: Add caching
+@lru_cache(maxsize=32)
 def get_event_types_for_topic(topic: KafkaTopic) -> list[EventType]:
     """Get all event types that publish to a given topic."""
     return [et for et, t in EVENT_TYPE_TO_TOPIC.items() if t == topic]
backend/app/services/kafka_event_service.py (1)

67-73: Correlation ID precedence logic is correct but could be simplified.

The logic ensures explicit correlation_id parameter takes precedence over metadata's value. Consider simplifying by always setting correlation_id on metadata creation if provided.

♻️ Optional simplification
-            event_metadata = metadata or EventMetadata(
-                service_name=self.settings.SERVICE_NAME,
-                service_version=self.settings.SERVICE_VERSION,
-                correlation_id=correlation_id or str(uuid4()),
-            )
-            if correlation_id and event_metadata.correlation_id != correlation_id:
-                event_metadata = event_metadata.model_copy(update={"correlation_id": correlation_id})
+            effective_correlation_id = correlation_id or (metadata.correlation_id if metadata else None) or str(uuid4())
+            event_metadata = metadata.model_copy(update={"correlation_id": effective_correlation_id}) if metadata else EventMetadata(
+                service_name=self.settings.SERVICE_NAME,
+                service_version=self.settings.SERVICE_VERSION,
+                correlation_id=effective_correlation_id,
+            )
backend/app/services/idempotency/middleware.py (1)

206-250: Remove unused async_handler function.

The async_handler function is defined but never used. The code path either registers dispatch_handler with the dispatcher (line 258) or logs an error (line 261) - neither path uses async_handler. This appears to be dead code left over from a previous implementation.

♻️ Proposed fix
-        # Create an async handler that processes the message
-        async def async_handler(message: Any) -> Any:
-            self.logger.info(f"IDEMPOTENT HANDLER CALLED for {event_type}")
-
-            # Extract event from confluent-kafka Message
-            if not hasattr(message, "value"):
-                self.logger.error(f"Received non-Message object for {event_type}: {type(message)}")
-                return None
-
-            # Debug log to check message details
-            self.logger.info(
-                f"Handler for {event_type} - Message type: {type(message)}, "
-                f"has key: {hasattr(message, 'key')}, "
-                f"has topic: {hasattr(message, 'topic')}"
-            )
-
-            raw_value = message.value
-
-            # Debug the raw value
-            self.logger.info(f"Raw value extracted: {raw_value[:100] if raw_value else 'None or empty'}")
-
-            # Handle tombstone messages (null value for log compaction)
-            if raw_value is None:
-                self.logger.warning(f"Received empty message for {event_type} - tombstone or consumed value")
-                return None
-
-            # Handle empty messages
-            if not raw_value:
-                self.logger.warning(f"Received empty message for {event_type} - empty bytes")
-                return None
-
-            try:
-                # Deserialize using schema registry if available
-                event = await self.consumer._schema_registry.deserialize_event(raw_value, message.topic)
-                if not event:
-                    self.logger.error(f"Failed to deserialize event for {event_type}")
-                    return None
-
-                # Call the idempotent wrapper directly in async context
-                await idempotent_wrapper(event)
-
-                self.logger.debug(f"Successfully processed {event_type} event: {event.event_id}")
-                return None
-            except Exception as e:
-                self.logger.error(f"Failed to process message for {event_type}: {e}", exc_info=True)
-                raise
-
         # Register with the dispatcher if available
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2790997 and 9606e48.

📒 Files selected for processing (92)
  • backend/app/api/routes/dlq.py
  • backend/app/api/routes/events.py
  • backend/app/api/routes/execution.py
  • backend/app/db/docs/dlq.py
  • backend/app/dlq/manager.py
  • backend/app/dlq/models.py
  • backend/app/domain/enums/events.py
  • backend/app/domain/enums/kafka.py
  • backend/app/domain/events/typed.py
  • backend/app/events/core/consumer.py
  • backend/app/events/core/dispatcher.py
  • backend/app/events/core/dlq_handler.py
  • backend/app/events/core/producer.py
  • backend/app/events/event_store.py
  • backend/app/events/event_store_consumer.py
  • backend/app/events/schema/schema_registry.py
  • backend/app/infrastructure/kafka/__init__.py
  • backend/app/infrastructure/kafka/events/__init__.py
  • backend/app/infrastructure/kafka/events/base.py
  • backend/app/infrastructure/kafka/events/execution.py
  • backend/app/infrastructure/kafka/events/metadata.py
  • backend/app/infrastructure/kafka/events/notification.py
  • backend/app/infrastructure/kafka/events/pod.py
  • backend/app/infrastructure/kafka/events/saga.py
  • backend/app/infrastructure/kafka/events/system.py
  • backend/app/infrastructure/kafka/events/user.py
  • backend/app/infrastructure/kafka/mappings.py
  • backend/app/services/coordinator/coordinator.py
  • backend/app/services/event_replay/replay_service.py
  • backend/app/services/execution_service.py
  • backend/app/services/idempotency/idempotency_manager.py
  • backend/app/services/idempotency/middleware.py
  • backend/app/services/k8s_worker/pod_builder.py
  • backend/app/services/k8s_worker/worker.py
  • backend/app/services/kafka_event_service.py
  • backend/app/services/notification_service.py
  • backend/app/services/pod_monitor/event_mapper.py
  • backend/app/services/pod_monitor/monitor.py
  • backend/app/services/result_processor/processor.py
  • backend/app/services/saga/execution_saga.py
  • backend/app/services/saga/saga_orchestrator.py
  • backend/app/services/saga/saga_step.py
  • backend/app/services/sse/kafka_redis_bridge.py
  • backend/app/services/sse/redis_bus.py
  • backend/app/services/sse/sse_shutdown_manager.py
  • backend/tests/e2e/test_k8s_worker_create_pod.py
  • backend/tests/e2e/test_resource_cleaner_orphan.py
  • backend/tests/helpers/events.py
  • backend/tests/helpers/eventually.py
  • backend/tests/helpers/kafka.py
  • backend/tests/integration/dlq/test_dlq_manager.py
  • backend/tests/integration/events/test_consume_roundtrip.py
  • backend/tests/integration/events/test_dlq_handler.py
  • backend/tests/integration/events/test_event_dispatcher.py
  • backend/tests/integration/events/test_event_store.py
  • backend/tests/integration/events/test_schema_registry_real.py
  • backend/tests/integration/idempotency/test_consumer_idempotent.py
  • backend/tests/integration/idempotency/test_decorator_idempotent.py
  • backend/tests/integration/idempotency/test_idempotency.py
  • backend/tests/integration/idempotency/test_idempotent_handler.py
  • backend/tests/integration/notifications/test_notification_sse.py
  • backend/tests/integration/result_processor/test_result_processor.py
  • backend/tests/integration/services/coordinator/test_execution_coordinator.py
  • backend/tests/integration/services/events/test_event_bus.py
  • backend/tests/integration/services/sse/test_partitioned_event_router.py
  • backend/tests/integration/services/sse/test_redis_bus.py
  • backend/tests/integration/test_notifications_routes.py
  • backend/tests/integration/test_replay_routes.py
  • backend/tests/integration/test_sse_routes.py
  • backend/tests/integration/test_user_settings_routes.py
  • backend/tests/unit/domain/events/test_event_schema_coverage.py
  • backend/tests/unit/events/test_event_dispatcher.py
  • backend/tests/unit/events/test_metadata_model.py
  • backend/tests/unit/events/test_schema_registry_manager.py
  • backend/tests/unit/services/coordinator/test_queue_manager.py
  • backend/tests/unit/services/idempotency/test_idempotency_manager.py
  • backend/tests/unit/services/idempotency/test_middleware.py
  • backend/tests/unit/services/pod_monitor/test_event_mapper.py
  • backend/tests/unit/services/pod_monitor/test_monitor.py
  • backend/tests/unit/services/saga/test_execution_saga_steps.py
  • backend/tests/unit/services/saga/test_saga_comprehensive.py
  • backend/tests/unit/services/saga/test_saga_orchestrator_unit.py
  • backend/tests/unit/services/saga/test_saga_step_and_base.py
  • backend/tests/unit/services/sse/test_kafka_redis_bridge.py
  • backend/tests/unit/services/sse/test_sse_shutdown_manager.py
  • backend/tests/unit/services/test_pod_builder.py
  • backend/workers/dlq_processor.py
  • docs/architecture/event-system-design.md
  • docs/architecture/kafka-topic-architecture.md
  • docs/architecture/model-conversion.md
  • docs/architecture/user-settings-events.md
  • docs/components/schema-manager.md
💤 Files with no reviewable changes (10)
  • backend/tests/helpers/eventually.py
  • backend/app/infrastructure/kafka/events/base.py
  • backend/app/infrastructure/kafka/events/execution.py
  • backend/app/infrastructure/kafka/events/notification.py
  • backend/app/infrastructure/kafka/events/metadata.py
  • backend/app/infrastructure/kafka/events/system.py
  • backend/app/infrastructure/kafka/events/user.py
  • backend/app/infrastructure/kafka/events/saga.py
  • backend/app/infrastructure/kafka/events/init.py
  • backend/app/infrastructure/kafka/events/pod.py
🧰 Additional context used
🧬 Code graph analysis (56)
backend/app/events/core/dlq_handler.py (1)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/tests/e2e/test_resource_cleaner_orphan.py (1)
backend/app/services/result_processor/resource_cleaner.py (1)
  • cleanup_orphaned_resources (149-173)
backend/app/services/idempotency/idempotency_manager.py (1)
backend/app/domain/events/typed.py (1)
  • BaseEvent (30-42)
backend/tests/integration/services/coordinator/test_execution_coordinator.py (2)
backend/app/services/coordinator/coordinator.py (1)
  • _handle_execution_requested (215-251)
backend/tests/unit/services/coordinator/test_queue_manager.py (1)
  • ev (14-15)
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
backend/app/domain/events/typed.py (1)
  • EventMetadata (16-27)
backend/tests/unit/events/test_schema_registry_manager.py (1)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
backend/app/domain/events/typed.py (1)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/app/services/k8s_worker/pod_builder.py (1)
backend/app/domain/events/typed.py (1)
  • CreatePodCommandEvent (382-397)
backend/app/services/saga/saga_step.py (1)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/services/pod_monitor/event_mapper.py (2)
backend/app/domain/events/typed.py (7)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (95-101)
  • ExecutionFailedEvent (104-112)
  • ExecutionTimeoutEvent (115-121)
  • PodRunningEvent (149-153)
  • PodScheduledEvent (142-146)
  • PodTerminatedEvent (176-182)
backend/app/domain/execution/models.py (1)
  • ResourceUsageDomain (13-19)
backend/tests/unit/services/sse/test_sse_shutdown_manager.py (2)
backend/app/services/sse/sse_shutdown_manager.py (6)
  • initiate_shutdown (127-154)
  • unregister_connection (109-125)
  • get_shutdown_status (273-286)
  • SSEShutdownManager (22-307)
  • register_connection (80-107)
  • is_shutting_down (269-271)
backend/tests/unit/services/sse/test_sse_service.py (4)
  • unregister_connection (91-92)
  • get_shutdown_status (97-104)
  • register_connection (87-89)
  • is_shutting_down (94-95)
backend/app/services/pod_monitor/monitor.py (1)
backend/app/services/kafka_event_service.py (1)
  • publish_domain_event (177-202)
backend/tests/unit/services/coordinator/test_queue_manager.py (1)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
backend/tests/integration/services/sse/test_partitioned_event_router.py (3)
backend/app/services/sse/redis_bus.py (1)
  • get (23-35)
backend/tests/unit/services/sse/test_sse_service.py (1)
  • get (32-41)
backend/app/schemas_pydantic/sse.py (1)
  • RedisSSEMessage (58-63)
backend/tests/integration/events/test_dlq_handler.py (1)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • SagaStartedEvent (326-331)
backend/tests/helpers/kafka.py (1)
backend/app/events/core/producer.py (2)
  • producer (56-57)
  • UnifiedProducer (23-218)
backend/app/api/routes/execution.py (1)
backend/app/domain/events/typed.py (2)
  • BaseEvent (30-42)
  • EventMetadata (16-27)
backend/tests/integration/notifications/test_notification_sse.py (3)
backend/app/services/sse/redis_bus.py (1)
  • get (23-35)
backend/tests/unit/services/sse/test_sse_service.py (1)
  • get (32-41)
backend/app/schemas_pydantic/sse.py (1)
  • RedisNotificationMessage (91-101)
backend/tests/integration/idempotency/test_idempotency.py (2)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/services/idempotency/middleware.py (1)
  • idempotent_handler (93-119)
backend/tests/unit/services/idempotency/test_idempotency_manager.py (1)
backend/app/domain/events/typed.py (1)
  • BaseEvent (30-42)
backend/tests/unit/events/test_event_dispatcher.py (2)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
backend/tests/helpers/events.py (1)
  • make_execution_requested_event (7-50)
backend/app/services/k8s_worker/worker.py (1)
backend/app/domain/events/typed.py (5)
  • CreatePodCommandEvent (382-397)
  • DeletePodCommandEvent (400-406)
  • ExecutionFailedEvent (104-112)
  • ExecutionStartedEvent (80-85)
  • PodCreatedEvent (135-139)
backend/app/events/event_store.py (3)
backend/tests/unit/services/pod_monitor/test_monitor.py (1)
  • store_event (53-55)
backend/app/db/repositories/event_repository.py (2)
  • store_event (39-52)
  • get_event (68-72)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/tests/integration/events/test_event_store.py (3)
backend/tests/conftest.py (1)
  • app (79-91)
backend/tests/unit/conftest.py (1)
  • app (62-63)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/services/saga/saga_orchestrator.py (2)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • SagaCancelledEvent (351-360)
backend/app/services/saga/base_saga.py (1)
  • BaseSaga (8-52)
backend/tests/integration/test_sse_routes.py (1)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • PodCreatedEvent (135-139)
backend/app/services/notification_service.py (1)
backend/app/domain/events/typed.py (3)
  • ExecutionCompletedEvent (95-101)
  • ExecutionFailedEvent (104-112)
  • ExecutionTimeoutEvent (115-121)
backend/tests/unit/services/saga/test_execution_saga_steps.py (1)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
backend/tests/unit/events/test_metadata_model.py (2)
backend/app/domain/events/typed.py (1)
  • EventMetadata (16-27)
backend/app/services/coordinator/queue_manager.py (1)
  • user_id (34-35)
backend/app/api/routes/dlq.py (3)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/tests/integration/services/sse/test_redis_bus.py (1)
  • model_dump (27-28)
backend/tests/unit/services/sse/test_kafka_redis_bridge.py (1)
  • model_dump (41-42)
backend/tests/integration/idempotency/test_consumer_idempotent.py (4)
backend/app/events/core/dispatcher.py (2)
  • EventDispatcher (15-177)
  • register (36-57)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/tests/helpers/events.py (1)
  • make_execution_requested_event (7-50)
backend/app/events/core/producer.py (2)
  • producer (56-57)
  • produce (111-148)
backend/app/services/coordinator/coordinator.py (6)
backend/app/domain/events/typed.py (4)
  • CreatePodCommandEvent (382-397)
  • EventMetadata (16-27)
  • ExecutionAcceptedEvent (65-70)
  • ExecutionRequestedEvent (48-62)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
backend/app/events/core/consumer.py (1)
  • UnifiedConsumer (23-256)
backend/app/events/core/producer.py (1)
  • UnifiedProducer (23-218)
backend/app/events/event_store.py (1)
  • EventStore (19-314)
backend/app/events/schema/schema_registry.py (1)
  • SchemaRegistryManager (35-119)
backend/tests/unit/services/saga/test_saga_step_and_base.py (2)
backend/app/domain/events/typed.py (2)
  • BaseEvent (30-42)
  • EventMetadata (16-27)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/app/services/execution_service.py (1)
backend/app/domain/events/typed.py (3)
  • EventMetadata (16-27)
  • ExecutionCancelledEvent (124-129)
  • ExecutionRequestedEvent (48-62)
backend/tests/e2e/test_k8s_worker_create_pod.py (1)
backend/app/domain/events/typed.py (2)
  • CreatePodCommandEvent (382-397)
  • EventMetadata (16-27)
backend/app/services/idempotency/middleware.py (3)
backend/tests/unit/events/test_event_dispatcher.py (2)
  • handler (37-38)
  • handler (48-49)
backend/app/services/idempotency/idempotency_manager.py (1)
  • IdempotencyManager (69-315)
backend/tests/integration/idempotency/test_idempotency.py (1)
  • on_duplicate (351-352)
backend/app/events/event_store_consumer.py (2)
backend/app/services/saga/saga_orchestrator.py (1)
  • _handle_event (151-170)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/services/result_processor/processor.py (4)
backend/app/domain/events/typed.py (4)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (95-101)
  • ExecutionFailedEvent (104-112)
  • ExecutionTimeoutEvent (115-121)
backend/app/domain/execution/exceptions.py (1)
  • ExecutionNotFoundError (4-8)
backend/app/domain/execution/models.py (1)
  • ExecutionResultDomain (40-51)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
backend/tests/helpers/events.py (1)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • ExecutionRequestedEvent (48-62)
backend/app/events/core/producer.py (1)
backend/app/events/schema/schema_registry.py (1)
  • serialize_event (70-78)
backend/tests/integration/result_processor/test_result_processor.py (2)
backend/app/domain/events/typed.py (3)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (95-101)
  • ResultStoredEvent (195-200)
backend/app/domain/execution/models.py (1)
  • ResourceUsageDomain (13-19)
backend/app/dlq/models.py (2)
backend/app/core/utils.py (1)
  • StringEnum (6-31)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/tests/integration/services/events/test_event_bus.py (1)
backend/app/services/event_bus.py (2)
  • EventBusEvent (21-29)
  • subscribe (162-188)
backend/app/db/docs/dlq.py (2)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/dlq/models.py (1)
  • DLQMessageStatus (12-18)
backend/tests/unit/services/test_pod_builder.py (2)
backend/app/domain/events/typed.py (2)
  • CreatePodCommandEvent (382-397)
  • EventMetadata (16-27)
backend/app/services/coordinator/queue_manager.py (1)
  • user_id (34-35)
backend/app/api/routes/events.py (1)
backend/app/domain/events/typed.py (2)
  • BaseEvent (30-42)
  • EventMetadata (16-27)
backend/tests/integration/services/sse/test_redis_bus.py (2)
backend/app/domain/events/typed.py (1)
  • EventMetadata (16-27)
backend/tests/unit/services/sse/test_kafka_redis_bridge.py (1)
  • _make_metadata (32-33)
backend/tests/integration/dlq/test_dlq_manager.py (3)
backend/app/dlq/manager.py (1)
  • create_dlq_manager (443-480)
backend/app/domain/enums/kafka.py (1)
  • KafkaTopic (7-54)
backend/app/domain/events/typed.py (1)
  • DLQMessageReceivedEvent (522-532)
backend/tests/unit/services/saga/test_saga_comprehensive.py (2)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
backend/app/services/saga/saga_step.py (3)
  • SagaStep (73-100)
  • execute (80-88)
  • SagaContext (14-70)
backend/app/services/kafka_event_service.py (4)
backend/app/domain/events/typed.py (1)
  • EventMetadata (16-27)
backend/app/db/repositories/event_repository.py (1)
  • store_event (39-52)
backend/app/events/core/producer.py (2)
  • produce (111-148)
  • metrics (52-53)
backend/app/core/tracing/utils.py (1)
  • inject_trace_context (88-100)
backend/app/events/schema/schema_registry.py (1)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/tests/unit/services/pod_monitor/test_monitor.py (2)
backend/app/domain/events/typed.py (3)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (95-101)
  • ExecutionStartedEvent (80-85)
backend/app/events/core/producer.py (1)
  • produce (111-148)
backend/tests/integration/idempotency/test_decorator_idempotent.py (1)
backend/app/services/idempotency/middleware.py (1)
  • idempotent_handler (93-119)
backend/tests/unit/domain/events/test_event_schema_coverage.py (1)
backend/app/domain/events/typed.py (1)
  • BaseEvent (30-42)
backend/tests/integration/events/test_schema_registry_real.py (1)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • PodCreatedEvent (135-139)
backend/app/events/core/dispatcher.py (2)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_event_class_for_type (91-93)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
🪛 GitHub Actions: docs
backend/app/dlq/manager.py

[error] 26-26: ModuleNotFoundError: No module named 'app.infrastructure.kafka.events' while importing DLQManager. Step './deploy.sh openapi' failed with exit code 1.

🪛 GitHub Actions: MyPy Type Checking
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py

[error] 30-30: Class cannot subclass "BaseEvent" (has type "Any")

backend/workers/dlq_processor.py

[error] 69-69: Returning Any from function declared to return "bool"


[error] 69-69: "DLQMessage" has no attribute "age_seconds"

backend/tests/unit/services/sse/test_kafka_redis_bridge.py

[error] 36-36: Class cannot subclass "BaseEvent" (has type "Any")

backend/app/services/notification_service.py

[error] 632-632: Item "None" of "ResourceUsageDomain | None" has no attribute "execution_time_wall_seconds"

backend/tests/unit/services/saga/test_saga_step_and_base.py

[error] 50-50: Argument 1 to "add_event" of "SagaContext" has incompatible type "BaseEvent"; expected one of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>


[error] 75-75: Type argument "BaseEvent" of "SagaStep" must be a subtype of "ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>"


[error] 82-82: Type argument "BaseEvent" of "SagaStep" must be a subtype of "ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>"

backend/app/services/result_processor/processor.py

[error] 178-178: Item "None" of "ResourceUsageDomain | None" has no attribute "execution_time_wall_seconds"


[error] 182-182: Item "None" of "ResourceUsageDomain | None" has no attribute "peak_memory_kb"


[error] 218-218: Argument 1 to "record_error" of "ExecutionMetrics" has incompatible type "ExecutionErrorType | None"; expected "str"

backend/tests/integration/services/sse/test_redis_bus.py

[error] 21-21: Class cannot subclass "BaseEvent" (has type "Any")

backend/tests/integration/dlq/test_dlq_manager.py

[error] 64-64: "ExecutionRequestedEvent" has no attribute "to_dict"

backend/tests/integration/events/test_schema_registry_real.py

[error] 24-24: "PodCreatedEvent" has no attribute "topic"

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Scan Backend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
backend/tests/integration/services/sse/test_redis_bus.py (1)

21-28: Fix type incompatibility causing pipeline failure.

The MyPy error indicates _DummyEvent is not part of the DomainEvent union expected by publish_event. After the unification, only specific domain event types are accepted.

Two options to fix:

  1. Use a real event type (preferred): Import and use an actual event class like ExecutionCompletedEvent instead of _DummyEvent.

  2. Cast for test purposes: If the dummy event structure is intentional for isolation, cast at the call site.

Option 1: Use a real event type
-from app.domain.events.typed import EventMetadata
-from app.domain.events.typed import BaseEvent
+from app.domain.events.typed import EventMetadata, ExecutionCompletedEvent

Then replace the _DummyEvent usage in the test:

-class _DummyEvent(BaseEvent):
-    """Dummy event for testing."""
-    execution_id: str = ""
-    status: str | None = None
-    topic: ClassVar[KafkaTopic] = KafkaTopic.EXECUTION_EVENTS
-
-    def model_dump(self, **kwargs: object) -> dict[str, Any]:
-        return {"execution_id": self.execution_id, "status": self.status}
     # Publish event
-    evt = _DummyEvent(
-        event_type=EventType.EXECUTION_COMPLETED,
-        metadata=_make_metadata(),
-        execution_id="exec-1",
-        status="completed"
-    )
+    evt = ExecutionCompletedEvent(
+        metadata=_make_metadata(),
+        execution_id="exec-1",
+    )
Option 2: Cast for test isolation
+from typing import cast
+from app.domain.events.typed import DomainEvent
-    await bus.publish_event("exec-1", evt)
+    await bus.publish_event("exec-1", cast(DomainEvent, evt))

Also applies to: 90-96

backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

30-44: Fix MyPy type errors: _FakeEvent is not compatible with expected DomainEvent union.

The pipeline reports 4 MyPy errors because _FakeEvent is not part of the DomainEvent type union expected by SagaStep and SagaOrchestrator methods. The type system now enforces that only registered domain event types can be used.

Options to resolve:

  1. Use an existing event type (e.g., ExecutionRequestedEvent) instead of a custom fake event
  2. Add # type: ignore comments on the affected lines if test flexibility is more important than strict typing
  3. Cast the fake event where needed: cast(DomainEvent, _make_event(...))
Option 2: Add type ignore comments
-class _StepOK(SagaStep[_FakeEvent]):
+class _StepOK(SagaStep[_FakeEvent]):  # type: ignore[type-arg]
-    def get_steps(self) -> list[SagaStep[_FakeEvent]]:
+    def get_steps(self) -> list[SagaStep[_FakeEvent]]:  # type: ignore[type-arg]
-    await orch._handle_event(_make_event(EventType.EXECUTION_REQUESTED, "e"))
+    await orch._handle_event(_make_event(EventType.EXECUTION_REQUESTED, "e"))  # type: ignore[arg-type]
-    assert orch._should_trigger_saga(_Saga, _make_event(EventType.EXECUTION_REQUESTED, "e")) is True
+    assert orch._should_trigger_saga(_Saga, _make_event(EventType.EXECUTION_REQUESTED, "e")) is True  # type: ignore[arg-type]
backend/tests/integration/dlq/test_dlq_manager.py (1)

63-70: Fix: to_dict() method does not exist on domain events.

The pipeline failure indicates ExecutionRequestedEvent has no attribute to_dict. Domain events now use Pydantic's model_dump() method.

Proposed fix
     payload = {
-        "event": ev.to_dict(),
+        "event": ev.model_dump(mode="json"),
         "original_topic": f"{prefix}{str(KafkaTopic.EXECUTION_EVENTS)}",
         "error": "handler failed",
         "retry_count": 0,
         "failed_at": datetime.now(timezone.utc).isoformat(),
         "producer_id": "tests",
     }
backend/app/dlq/manager.py (2)

1-32: Fix: Import block is un-sorted.

The Ruff linter reports unsorted imports. The app.domain.events.typed import should be grouped with other app.* imports in alphabetical order.

Run ruff check --fix to auto-fix, or manually reorder:

Suggested import ordering
 from app.core.lifecycle import LifecycleEnabled
 from app.core.metrics.context import get_dlq_metrics
 from app.core.tracing import EventAttributes
 from app.core.tracing.utils import extract_trace_context, get_tracer, inject_trace_context
 from app.db.docs import DLQMessageDocument
 from app.dlq.models import (
     DLQBatchRetryResult,
     DLQMessage,
     DLQMessageStatus,
     DLQMessageUpdate,
     DLQRetryResult,
     RetryPolicy,
     RetryStrategy,
 )
 from app.domain.enums.kafka import GroupId, KafkaTopic
-from app.events.schema.schema_registry import SchemaRegistryManager
 from app.domain.events.typed import (
     DLQMessageDiscardedEvent,
     DLQMessageReceivedEvent,
     DLQMessageRetriedEvent,
     EventMetadata,
 )
+from app.events.schema.schema_registry import SchemaRegistryManager
 from app.settings import Settings

206-255: Non-atomic dual-topic send could leave inconsistent state.

The method sends to retry_topic then original_topic sequentially. If the second send_and_wait fails after the first succeeds, the message appears in the retry topic but not the original, and the status update won't run.

Consider wrapping in a try/except to handle partial failure, or at minimum updating status to reflect the partial state.

Suggested improvement
+        # Send to both topics - if second fails, we still want to track partial retry
+        sent_to_retry = False
+        sent_to_original = False
         # Send to retry topic
-        await self.producer.send_and_wait(
-            topic=retry_topic,
-            value=json.dumps(event.model_dump(mode="json")).encode(),
-            key=message.event.event_id.encode(),
-            headers=kafka_headers,
-        )
-
-        # Send to original topic
-        await self.producer.send_and_wait(
-            topic=message.original_topic,
-            value=json.dumps(event.model_dump(mode="json")).encode(),
-            key=message.event.event_id.encode(),
-            headers=kafka_headers,
-        )
+        try:
+            await self.producer.send_and_wait(
+                topic=retry_topic,
+                value=json.dumps(event.model_dump(mode="json")).encode(),
+                key=message.event.event_id.encode(),
+                headers=kafka_headers,
+            )
+            sent_to_retry = True
+
+            await self.producer.send_and_wait(
+                topic=message.original_topic,
+                value=json.dumps(event.model_dump(mode="json")).encode(),
+                key=message.event.event_id.encode(),
+                headers=kafka_headers,
+            )
+            sent_to_original = True
+        except Exception as e:
+            self.logger.error(
+                f"Partial retry failure: retry_topic={sent_to_retry}, original={sent_to_original}, error={e}",
+                extra={"event_id": message.event.event_id},
+            )
+            raise
🧹 Nitpick comments (4)
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

12-17: Consider consolidating imports from the same module.

Both EventMetadata (line 12) and BaseEvent (line 17) are imported from app.domain.events.typed. These can be combined into a single import statement.

Suggested fix
-from app.domain.events.typed import EventMetadata
 from app.domain.saga.models import Saga, SagaConfig
 from app.events.core import UnifiedProducer
 from app.events.event_store import EventStore
 from app.events.schema.schema_registry import SchemaRegistryManager
-from app.domain.events.typed import BaseEvent
+from app.domain.events.typed import BaseEvent, EventMetadata
backend/tests/integration/dlq/test_dlq_manager.py (1)

48-61: Consider logging more context on deserialization failures.

The exception handler only logs at debug level, which could hide issues during test debugging. For test code, consider logging at warning level or including the raw message value.

Suggested improvement
             except Exception as e:
-                _test_logger.debug(f"Error deserializing DLQ event: {e}")
+                _test_logger.warning(f"Error deserializing DLQ event: {e}, raw: {msg.value[:100] if msg.value else None}")
backend/app/dlq/manager.py (2)

72-76: Consider defensive handling for header decoding.

If a header value contains non-UTF-8 bytes, the .decode() call will raise UnicodeDecodeError. Consider using errors="replace" or errors="ignore" for resilience.

Suggested fix
     def _kafka_msg_to_message(self, msg: Any) -> DLQMessage:
         """Parse Kafka ConsumerRecord into DLQMessage."""
         data = json.loads(msg.value)
-        headers = {k: v.decode() for k, v in (msg.headers or [])}
+        headers = {k: v.decode(errors="replace") for k, v in (msg.headers or [])}
         return DLQMessage(**data, dlq_offset=msg.offset, dlq_partition=msg.partition, headers=headers)

107-138: Span attribute uses enum value instead of full topic name.

Line 125 sets EventAttributes.KAFKA_TOPIC: self.dlq_topic which is the KafkaTopic enum value, not the prefixed topic name. For consistency with how topics are referenced elsewhere, consider using the full topic name.

Suggested fix
                 with get_tracer().start_as_current_span(
                     name="dlq.consume",
                     context=ctx,
                     kind=SpanKind.CONSUMER,
                     attributes={
-                        EventAttributes.KAFKA_TOPIC: self.dlq_topic,
+                        EventAttributes.KAFKA_TOPIC: f"{self.settings.KAFKA_TOPIC_PREFIX}{self.dlq_topic}",
                         EventAttributes.EVENT_TYPE: dlq_msg.event.event_type,
                         EventAttributes.EVENT_ID: dlq_msg.event.event_id,
                     },
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9606e48 and 8ab9e31.

📒 Files selected for processing (7)
  • backend/app/dlq/manager.py
  • backend/app/services/coordinator/queue_manager.py
  • backend/tests/integration/dlq/test_dlq_manager.py
  • backend/tests/integration/services/sse/test_redis_bus.py
  • backend/tests/unit/services/saga/test_execution_saga_steps.py
  • backend/tests/unit/services/saga/test_saga_orchestrator_unit.py
  • backend/tests/unit/services/sse/test_kafka_redis_bridge.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/services/coordinator/queue_manager.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/tests/unit/services/sse/test_kafka_redis_bridge.py
  • backend/tests/unit/services/saga/test_execution_saga_steps.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/tests/integration/services/sse/test_redis_bus.py (2)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • BaseEvent (30-42)
backend/tests/unit/services/sse/test_kafka_redis_bridge.py (1)
  • _make_metadata (32-33)
backend/tests/integration/dlq/test_dlq_manager.py (5)
backend/app/dlq/manager.py (1)
  • create_dlq_manager (443-480)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/app/domain/enums/kafka.py (1)
  • KafkaTopic (7-54)
backend/app/events/schema/schema_registry.py (2)
  • SchemaRegistryManager (35-119)
  • deserialize_event (80-95)
backend/app/domain/events/typed.py (1)
  • DLQMessageReceivedEvent (522-532)
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • BaseEvent (30-42)
🪛 GitHub Actions: MyPy Type Checking
backend/tests/integration/services/sse/test_redis_bus.py

[error] 96-96: Argument 2 to "publish_event" of "SSERedisBus" has incompatible type "_DummyEvent"; expected a union of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>

backend/tests/integration/dlq/test_dlq_manager.py

[error] 64-64: "ExecutionRequestedEvent" has no attribute "to_dict".

backend/tests/unit/services/saga/test_saga_orchestrator_unit.py

[error] 97-97: Type argument "_FakeEvent" of "SagaStep" must be a subtype of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>.


[error] 117-117: Type argument "_FakeEvent" of "SagaStep" must be a subtype of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>.


[error] 145-145: Argument 1 to "_handle_event" of "SagaOrchestrator" has incompatible type "_FakeEvent"; expected one of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>.


[error] 165-165: Argument 2 to "_should_trigger_saga" of "SagaOrchestrator" has incompatible type "_FakeEvent"; expected one of ExecutionRequestedEvent | ExecutionAcceptedEvent | ExecutionQueuedEvent | ExecutionStartedEvent | ExecutionRunningEvent | <51 more items>.

🪛 GitHub Actions: Ruff Linting
backend/app/dlq/manager.py

[error] 1-32: Ruff check failed: Import block is un-sorted or un-formatted. Found 1 error fixable with the --fix option. Tool: Ruff (Python linter). Run 'ruff --fix' to auto-fix imports.

🔇 Additional comments (10)
backend/tests/integration/services/sse/test_redis_bus.py (2)

11-12: LGTM on import migration.

The imports are correctly updated to use the unified domain event model (EventMetadata and BaseEvent from app.domain.events.typed), consistent with the PR's objective to consolidate the event systems.


75-76: LGTM!

The metadata factory correctly uses EventMetadata with the required fields, matching the pattern used in other test files (e.g., test_kafka_redis_bridge.py).

backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (1)

40-42: LGTM!

The metadata field correctly uses the new EventMetadata type with appropriate default values for testing. The migration from AvroEventMetadata to EventMetadata aligns with the PR objectives.

backend/tests/integration/dlq/test_dlq_manager.py (2)

1-14: Imports look correct for the new unified event model.

The imports properly reference the new domain event types (EventType, DLQMessageReceivedEvent) from the unified app.domain packages, aligning with the PR's migration objectives.


88-103: LGTM! Solid async test structure with proper cleanup.

The test correctly:

  • Starts the consumer before the manager to ensure message capture
  • Uses asyncio.wait_for with timeout to prevent hangs
  • Properly cancels and cleans up the consumer task in the finally block
  • Validates key fields of the emitted DLQMessageReceivedEvent
backend/app/dlq/manager.py (5)

170-183: Event emission failure is silently logged, not propagated.

The _emit_message_received_event call at line 183 will swallow exceptions (as seen in _produce_dlq_event). This is reasonable for telemetry events, but consider whether failed emission should trigger any compensating action or metric.

The design choice to log-and-continue for emission failures is appropriate—DLQ storage is the primary concern, and emission is secondary telemetry.


322-372: LGTM! Clean event emission pattern.

The three emission methods follow a consistent pattern:

  • Create typed event with appropriate fields
  • Delegate to _produce_dlq_event for serialization and send
  • Silent failure with logging is appropriate for telemetry events

The use of schema_registry.serialize_event ensures proper Avro serialization aligned with the unified event model.


275-304: LGTM! Monitor loop handles scheduled retries appropriately.

The monitor:

  • Queries with sensible limit (100 messages)
  • Processes individually, allowing partial progress
  • Has different sleep intervals for normal (10s) vs error (60s) states
  • Failed retries remain SCHEDULED for next iteration

374-440: LGTM! Manual operations have proper state guards.

Both retry_message_manually and discard_message_manually:

  • Correctly query using nested {"event.event_id": event_id} path
  • Guard against terminal states (DISCARDED, RETRIED)
  • Return boolean success indicator

The batch retry properly aggregates results without short-circuiting on individual failures.


443-480: LGTM! Factory creates well-configured Kafka clients.

Producer configuration includes:

  • acks="all" for durability
  • enable_idempotence=True for exactly-once semantics
  • Compression and batching for efficiency

Consumer configuration properly uses manual commit with earliest offset reset.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 14 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/tests/integration/db/repositories/test_dlq_repository.py">

<violation number="1" location="backend/tests/integration/db/repositories/test_dlq_repository.py:89">
P1: Test will fail: `msg.event.event_id` will be a random UUID, not `"id1"`. The test data setup passes `event_id="id1"` as a top-level field (which the new model ignores), but the `event` dict doesn't include `event_id`. The test data needs to include `"event_id": "id1"` inside each `event` dict.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/tests/integration/db/repositories/test_dlq_repository.py (1)

24-33: Test assertion on line 89 is broken and will fail.

The assertion msg.event.event_id == "id1" has two critical issues:

  1. Incorrect dict access syntax: The event field is a dict[str, Any], so accessing it with attribute notation (msg.event.event_id) raises AttributeError. Use dict key access instead: msg.event["event_id"] or msg.event.get("event_id").

  2. Missing event_id in fixture data: Even with correct access syntax, the event dicts in test fixtures (lines 28-33, 44-49, 60-65) do not contain the event_id key.

Add "event_id" to each event dict in the fixtures, and fix the assertion to use proper dict key access.

backend/workers/dlq_processor.py (1)

76-107: DLQ manager is never started - processing will not occur.

The main() function initializes the manager and configures policies/filters, but it never calls manager.start() (or uses async with manager). The function just waits on stop_event indefinitely without starting any processing. The DLQManager inherits from LifecycleEnabled and requires explicit startup.

🐛 Proposed fix
     async with AsyncExitStack() as stack:
         stack.push_async_callback(container.close)
+        await manager.start()
+        stack.push_async_callback(manager.stop)
         await stop_event.wait()

Alternatively, use the context manager pattern:

     async with AsyncExitStack() as stack:
         stack.push_async_callback(container.close)
+        await stack.enter_async_context(manager)
         await stop_event.wait()
🧹 Nitpick comments (8)
backend/app/services/result_processor/processor.py (1)

155-165: Consider using runtime checks instead of assert for type narrowing.

The assert isinstance(...) pattern works for type narrowing, but assert statements are removed when Python runs with optimization flags (-O). If the dispatcher ever routes an incorrect event type in production with optimizations enabled, these assertions would be silently skipped, leading to attribute errors downstream.

♻️ Suggested alternative using explicit checks
     async def _handle_completed_wrapper(self, event: DomainEvent) -> None:
-        assert isinstance(event, ExecutionCompletedEvent)
+        if not isinstance(event, ExecutionCompletedEvent):
+            raise TypeError(f"Expected ExecutionCompletedEvent, got {type(event).__name__}")
         await self._handle_completed(event)

     async def _handle_failed_wrapper(self, event: DomainEvent) -> None:
-        assert isinstance(event, ExecutionFailedEvent)
+        if not isinstance(event, ExecutionFailedEvent):
+            raise TypeError(f"Expected ExecutionFailedEvent, got {type(event).__name__}")
         await self._handle_failed(event)

     async def _handle_timeout_wrapper(self, event: DomainEvent) -> None:
-        assert isinstance(event, ExecutionTimeoutEvent)
+        if not isinstance(event, ExecutionTimeoutEvent):
+            raise TypeError(f"Expected ExecutionTimeoutEvent, got {type(event).__name__}")
         await self._handle_timeout(event)
backend/tests/integration/db/repositories/test_dlq_repository.py (1)

90-91: Assertions always pass regardless of return value.

assert await repo.mark_message_retried("id1") in (True, False) will always pass since any boolean is in (True, False). If the intent is to verify these methods don't raise exceptions, consider making that explicit. If you expect specific outcomes (e.g., success on first call), assert those values directly.

💡 Suggested improvement
-    assert await repo.mark_message_retried("id1") in (True, False)
-    assert await repo.mark_message_discarded("id1", "r") in (True, False)
+    # Verify methods execute without error; first call should succeed
+    result_retried = await repo.mark_message_retried("id1")
+    assert result_retried is True, "Expected successful retry mark"
+    result_discarded = await repo.mark_message_discarded("id1", "r")
+    # After retry, discard may or may not succeed depending on business logic
+    assert isinstance(result_discarded, bool)
backend/tests/integration/dlq/test_dlq_manager.py (1)

48-61: Silent exception handling may hide deserialization issues.

The consume_dlq_events function catches all exceptions and logs at debug level, which could mask real problems during test execution. Consider logging at a higher level or re-raising after a certain number of failures.

♻️ Suggested improvement
             except Exception as e:
-                _test_logger.debug(f"Error deserializing DLQ event: {e}")
+                _test_logger.warning(f"Error deserializing DLQ event: {e}")
backend/app/dlq/manager.py (5)

107-138: Message commit occurs even if storage or event emission fails.

In _process_messages, the commit happens after _process_dlq_message returns, but _process_dlq_message could partially succeed (e.g., store the message but fail during event emission). If the emit fails silently (caught in _produce_dlq_event), the message is committed but the received event was never emitted. Consider whether this is the desired behavior.

Also, if an exception occurs in _process_dlq_message, the message is not committed (good), but the loop continues to the next message without any backoff, which could cause tight-loop failures.

♻️ Consider adding backoff on repeated failures
             except Exception as e:
                 self.logger.error(f"Error processing DLQ message: {e}")
+                await asyncio.sleep(1)  # Brief backoff to prevent tight-loop failures

170-183: Upsert pattern may have race condition.

The _store_message method performs a find-then-save pattern which is not atomic. If two instances process the same message concurrently, both could find no existing document and attempt to insert, potentially causing duplicate key errors (though the unique index on event.event_id would prevent actual duplicates).

Consider using upsert=True with find_one_and_update or catching the duplicate key exception.

♻️ Consider using Beanie's upsert capability
-    existing = await DLQMessageDocument.find_one({"event.event_id": message.event.event_id})
-    if existing:
-        doc.id = existing.id
-    await doc.save()
+    await DLQMessageDocument.find_one(
+        {"event.event_id": message.event.event_id}
+    ).upsert(
+        Set(message.model_dump()),
+        on_insert=doc
+    )

Alternatively, catch DuplicateKeyError and update:

try:
    await doc.insert()
except DuplicateKeyError:
    existing = await DLQMessageDocument.find_one({"event.event_id": message.event.event_id})
    if existing:
        doc.id = existing.id
        await doc.save()

360-372: DLQ event emission failures are silently logged - consider retry or metrics.

The _produce_dlq_event method catches all exceptions and only logs the error. For observability events, this is acceptable, but you lose visibility into emission failures. Consider:

  1. Recording a metric for emission failures
  2. Adding structured logging with event details
♻️ Add metrics for emission failures
         except Exception as e:
             self.logger.error(f"Failed to emit DLQ event {event.event_type}: {e}")
+            self.metrics.record_dlq_event_emission_failure(str(event.event_type))

389-416: Batch retry processes sequentially - consider concurrency for large batches.

The retry_messages_batch method processes each event ID sequentially. For large batches, this could be slow. Consider using asyncio.gather with a concurrency limit for better throughput.

♻️ Concurrent batch processing with semaphore
async def retry_messages_batch(self, event_ids: list[str]) -> DLQBatchRetryResult:
    semaphore = asyncio.Semaphore(10)  # Limit concurrency
    
    async def retry_with_limit(event_id: str) -> DLQRetryResult:
        async with semaphore:
            try:
                success = await self.retry_message_manually(event_id)
                return DLQRetryResult(
                    event_id=event_id,
                    status="success" if success else "failed",
                    error=None if success else "Retry failed"
                )
            except Exception as e:
                self.logger.error(f"Error retrying message {event_id}: {e}")
                return DLQRetryResult(event_id=event_id, status="failed", error=str(e))
    
    results = await asyncio.gather(*[retry_with_limit(eid) for eid in event_ids])
    successful = sum(1 for r in results if r.status == "success")
    return DLQBatchRetryResult(
        total=len(event_ids),
        successful=successful,
        failed=len(event_ids) - successful,
        details=list(results)
    )

69-70: Hardcoded service metadata - consider making configurable.

The _event_metadata uses hardcoded values for service_name and service_version. Consider deriving these from settings or environment variables for better maintainability across deployments.

♻️ Use settings for service metadata
-        self._event_metadata = EventMetadata(service_name="dlq-manager", service_version="1.0.0")
+        self._event_metadata = EventMetadata(
+            service_name=getattr(settings, "SERVICE_NAME", "dlq-manager"),
+            service_version=getattr(settings, "SERVICE_VERSION", "1.0.0")
+        )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab9e31 and 62473fd.

📒 Files selected for processing (14)
  • backend/app/dlq/manager.py
  • backend/app/services/notification_service.py
  • backend/app/services/result_processor/processor.py
  • backend/tests/integration/db/repositories/test_dlq_repository.py
  • backend/tests/integration/dlq/test_dlq_manager.py
  • backend/tests/integration/events/test_producer_roundtrip.py
  • backend/tests/integration/events/test_schema_registry_real.py
  • backend/tests/integration/events/test_schema_registry_roundtrip.py
  • backend/tests/integration/services/sse/test_redis_bus.py
  • backend/tests/unit/services/saga/test_execution_saga_steps.py
  • backend/tests/unit/services/saga/test_saga_orchestrator_unit.py
  • backend/tests/unit/services/saga/test_saga_step_and_base.py
  • backend/tests/unit/services/sse/test_kafka_redis_bridge.py
  • backend/workers/dlq_processor.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/integration/services/sse/test_redis_bus.py
🧰 Additional context used
🧬 Code graph analysis (12)
backend/tests/integration/db/repositories/test_dlq_repository.py (1)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/services/result_processor/processor.py (5)
backend/app/domain/events/typed.py (4)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (95-101)
  • ExecutionFailedEvent (104-112)
  • ExecutionTimeoutEvent (115-121)
backend/app/domain/execution/models.py (1)
  • ExecutionResultDomain (40-51)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
backend/app/events/schema/schema_registry.py (1)
  • SchemaRegistryManager (35-119)
backend/app/core/metrics/execution.py (3)
  • record_execution_duration (68-69)
  • record_memory_usage (77-78)
  • record_error (80-81)
backend/tests/unit/services/saga/test_saga_step_and_base.py (2)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • SystemErrorEvent (497-502)
backend/app/services/saga/saga_step.py (4)
  • SagaStep (73-100)
  • execute (80-88)
  • SagaContext (14-70)
  • can_execute (95-97)
backend/app/services/notification_service.py (1)
backend/app/domain/events/typed.py (3)
  • ExecutionCompletedEvent (95-101)
  • ExecutionFailedEvent (104-112)
  • ExecutionTimeoutEvent (115-121)
backend/workers/dlq_processor.py (3)
backend/app/services/coordinator/queue_manager.py (1)
  • age_seconds (38-39)
backend/app/dlq/manager.py (1)
  • add_filter (319-320)
backend/app/db/docs/dlq.py (1)
  • Settings (34-43)
backend/tests/integration/events/test_producer_roundtrip.py (1)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_topic_for_event (97-99)
backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (3)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
backend/app/services/saga/saga_step.py (4)
  • CompensationStep (103-121)
  • SagaContext (14-70)
  • SagaStep (73-100)
  • execute (80-88)
backend/tests/helpers/events.py (1)
  • make_execution_requested_event (7-50)
backend/tests/integration/dlq/test_dlq_manager.py (6)
backend/app/dlq/manager.py (1)
  • create_dlq_manager (443-480)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/app/domain/enums/kafka.py (1)
  • KafkaTopic (7-54)
backend/app/domain/events/typed.py (1)
  • DLQMessageReceivedEvent (522-532)
backend/app/settings.py (1)
  • Settings (11-169)
backend/app/events/schema/schema_registry.py (2)
  • SchemaRegistryManager (35-119)
  • deserialize_event (80-95)
backend/tests/unit/services/sse/test_kafka_redis_bridge.py (2)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • ExecutionStartedEvent (80-85)
backend/app/services/sse/redis_bus.py (1)
  • publish_event (65-71)
backend/tests/integration/events/test_schema_registry_real.py (2)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • PodCreatedEvent (135-139)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_topic_for_event (97-99)
backend/tests/integration/events/test_schema_registry_roundtrip.py (2)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_topic_for_event (97-99)
backend/app/events/schema/schema_registry.py (1)
  • deserialize_event (80-95)
backend/tests/unit/services/saga/test_execution_saga_steps.py (1)
backend/app/domain/events/typed.py (1)
  • ExecutionRequestedEvent (48-62)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (32)
backend/tests/integration/events/test_schema_registry_roundtrip.py (3)

5-5: LGTM!

Import correctly added to support centralized topic resolution via the mappings module.


23-24: LGTM!

Topic derivation correctly updated to use the centralized get_topic_for_event mapping. The str() conversion is appropriate since get_topic_for_event returns a KafkaTopic enum while deserialize_event expects a string parameter.


27-28: LGTM!

Good addition to exercise the idempotent initialization path of initialize_schemas(). This helps ensure coverage of the lazy initialization behavior mentioned in the comment at line 19.

backend/tests/integration/events/test_schema_registry_real.py (2)

4-6: LGTM!

Imports correctly updated to use the unified domain event model. The switch from AvroEventMetadata to EventMetadata and sourcing PodCreatedEvent from app.domain.events.typed aligns with the PR objective of consolidating to a single event system.


18-26: LGTM!

Test correctly updated to use the unified event model:

  • EventMetadata replaces AvroEventMetadata with the same required fields
  • Topic derivation via get_topic_for_event(ev.event_type) follows the new centralized mapping pattern
  • The str() conversion appropriately handles the KafkaTopic enum for the deserialize_event string parameter
backend/tests/unit/services/saga/test_execution_saga_steps.py (2)

3-3: LGTM!

Import correctly updated from the infrastructure layer to the domain layer, aligning with the unified event model migration.


112-120: LGTM!

The _FakeProducer correctly updated to use DomainEvent type hints instead of BaseEvent, consistent with the unified event model. The minimal implementation (skipping parent __init__) is appropriate for test isolation.

backend/tests/integration/events/test_producer_roundtrip.py (2)

7-7: LGTM!

Import correctly reflects the migration to the centralized topic mapping helper.


35-36: LGTM!

The topic derivation now correctly uses the centralized mapping via get_topic_for_event, and the explicit str() cast properly converts the KafkaTopic enum to the string expected by send_to_dlq.

backend/tests/unit/services/sse/test_kafka_redis_bridge.py (4)

7-7: LGTM!

Import correctly updated to use the unified domain event types from app.domain.events.typed, consistent with the PR's migration to the single event model.


19-26: LGTM!

The _FakeBus type hints correctly updated to DomainEvent, matching the production SSERedisBus.publish_event signature.


29-30: LGTM!

Helper correctly updated to use EventMetadata with the required fields. The rename from AvroEventMetadata aligns with the PR migration.


54-61: LGTM!

Good improvement: using the actual ExecutionStartedEvent type instead of a custom dummy class provides more realistic test coverage and validates the real event structure works correctly with the routing handlers.

backend/app/services/notification_service.py (4)

22-27: LGTM!

The imports correctly reflect the migration to the unified DomainEvent model. All imported types are used appropriately within the file.


330-330: Verify that blocking on delivery is intentional.

This change from fire-and-forget to awaiting _deliver_notification means create_notification will block until delivery completes. For webhook and Slack channels, this could add up to 30 seconds of latency (the configured timeout) if external services are slow.

If immediate API response is desired, consider restoring async task dispatch or applying this only to the IN_APP channel.


631-632: LGTM!

The defensive null-check on event.resource_usage prevents AttributeError when the field is None, with a sensible fallback to 0.0.


644-656: LGTM!

The unified handler correctly uses isinstance checks to narrow the generic DomainEvent to specific typed handlers, with appropriate fallback logging for unhandled types and exception handling.

backend/tests/unit/services/saga/test_saga_step_and_base.py (3)

6-6: LGTM!

Import correctly updated to use the new domain-typed event system.


46-51: LGTM!

Test event construction correctly uses SystemErrorEvent with proper required fields (error_type, message, service_name) and embedded EventMetadata. This aligns with the unified event model.


77-78: LGTM!

Type annotations consistently updated to SystemErrorEvent across the saga step definition, execute signature, and mock spec. The changes align with the broader migration to domain-typed events.

Also applies to: 84-86, 94-94

backend/tests/unit/services/saga/test_saga_orchestrator_unit.py (4)

9-9: LGTM!

Imports correctly updated to use DomainEvent and ExecutionRequestedEvent from the domain layer, with the test helper make_execution_requested_event properly separated as a local import.

Also applies to: 19-20


48-51: LGTM!

The fake producer signature correctly accepts DomainEvent, matching the updated UnifiedProducer interface.


78-86: LGTM!

_StepOK and _Saga correctly parameterized with ExecutionRequestedEvent. The execute method signature matches the SagaStep abstract interface.

Also applies to: 98-99


122-122: LGTM!

Consistent use of make_execution_requested_event helper with execution_id override. This is cleaner than the previous approach and leverages the test helper's sensible defaults.

Also applies to: 142-142, 146-146

backend/app/services/result_processor/processor.py (3)

15-26: LGTM!

Imports correctly updated to include DomainEvent and EventMetadata from the unified domain events module.


178-192: LGTM on conditional metrics; acknowledge the brittle memory limit parsing.

The conditional check for resource_usage is a good defensive change that prevents AttributeError when resource data is unavailable.

However, line 188's parsing of K8S_POD_MEMORY_LIMIT is fragile as noted in the TODO. It assumes the format ends with "Mi", but will fail for:

  • "256Gi"int("256G")ValueError
  • "256M"int("256") but semantically wrong (megabytes vs mebibytes)
  • "268435456" (raw bytes) → int("268435456") → incorrect value

Consider extracting a utility function to parse Kubernetes resource quantities properly.


219-219: LGTM!

Reasonable fallback to "unknown" when error_type is not present, ensuring metrics recording doesn't fail.

backend/workers/dlq_processor.py (1)

68-71: Age calculation approach is correct.

The switch to explicit datetime calculation using datetime.now(timezone.utc) - message.failed_at is appropriate since DLQMessage doesn't have an age_seconds property (that exists in QueuedItem from the queue_manager, per the relevant snippets).

backend/tests/integration/dlq/test_dlq_manager.py (2)

84-103: Good async-first test pattern using Future instead of polling.

The approach of using asyncio.Future with asyncio.wait_for is cleaner than polling-based verification. The cleanup in the finally block properly cancels the consumer task and stops the consumer.


27-28: No changes needed. Beanie is properly initialized through the app's lifespan context, which runs when the app fixture is created. Although the test doesn't explicitly request the db fixture, the autouse _cleanup fixture in backend/tests/integration/conftest.py requires it, which establishes the dependency chain: _cleanupdbscopeapp_containerapp (with lifespan). This ensures init_beanie is called at app startup before the test runs.

Likely an incorrect or invalid review comment.

backend/app/dlq/manager.py (2)

206-255: Retry sends to both retry and original topics - potential duplicate processing.

The _retry_message method sends the event to both a retry topic (line 222-227) and the original topic (line 230-235). This design choice means:

  1. The retry topic serves as an audit/monitoring trail
  2. The original topic gets the actual retry

This is valid if intentional, but consumers on the original topic should be idempotent since they'll receive the same event again.

Confirm this dual-publish behavior is intentional for the retry flow. The original consumers should handle idempotency to prevent duplicate processing.


322-358: DLQ lifecycle event emission methods are well-structured.

The three emit methods (_emit_message_received_event, _emit_message_retried_event, _emit_message_discarded_event) follow a consistent pattern and correctly construct the domain events with proper metadata. The common _produce_dlq_event method handles serialization and publishing with error resilience.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @backend/app/db/repositories/dlq_repository.py:
- Around line 26-32: Update usages that assume a top-level event_type to
reference the nested event object: in the aggregation pipeline change the group
key from "$event_type" to "$event.event_type" so grouping uses the nested
DomainEvent field, and in query conditions replace any
DLQMessageDocument.event_type references with
DLQMessageDocument.event.event_type (e.g., in the code that builds filters or
queries) so they access the nested event.event_type property.
🧹 Nitpick comments (1)
backend/app/domain/events/typed.py (1)

16-27: Consider consistency: use_enum_values=True present here but missing in BaseEvent.

EventMetadata has use_enum_values=True in its ConfigDict, but BaseEvent (line 33) only has from_attributes=True. If enums should serialize consistently across all event types, consider adding use_enum_values=True to BaseEvent as well to ensure uniform serialization behavior.

♻️ Optional: Add use_enum_values for consistency
 class BaseEvent(AvroBase):
     """Base fields for all domain events."""

-    model_config = ConfigDict(from_attributes=True)
+    model_config = ConfigDict(from_attributes=True, use_enum_values=True)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62473fd and 05a7f8d.

📒 Files selected for processing (3)
  • backend/app/db/repositories/dlq_repository.py
  • backend/app/domain/events/typed.py
  • backend/app/services/idempotency/idempotency_manager.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/app/db/repositories/dlq_repository.py (2)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_event_class_for_type (91-93)
backend/app/domain/events/typed.py (1)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
backend/app/services/idempotency/idempotency_manager.py (1)
backend/app/domain/events/typed.py (1)
  • BaseEvent (30-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (7)
backend/app/domain/events/typed.py (5)

1-6: LGTM - Clean migration to AvroBase.

The import restructuring correctly brings in AvroBase from pydantic_avro to enable Avro schema generation across all domain events.


30-40: LGTM - BaseEvent properly extends AvroBase.

The event_type: EventType field is correctly added as a required discriminator field, enabling polymorphic deserialization via the DomainEvent union.


517-553: LGTM - Well-structured DLQ event classes.

The three new DLQ event types (DLQMessageReceivedEvent, DLQMessageRetriedEvent, DLQMessageDiscardedEvent) follow the established pattern with appropriate fields and correct Literal discriminators matching the EventType enum values from app.domain.enums.events.


558-574: Intentional design: ArchivedEvent duplicates base fields rather than inheriting.

ArchivedEvent extends AvroBase directly and duplicates fields from BaseEvent. This appears intentional since it's a wrapper for archived event data with additional archive-specific metadata (deleted_at, deleted_by, deletion_reason). The separate hierarchy avoids ArchivedEvent being part of the DomainEvent union.


643-649: LGTM - DomainEvent union correctly updated.

The discriminated union now includes all three new DLQ event types, maintaining the event_type discriminator for polymorphic deserialization.

backend/app/services/idempotency/idempotency_manager.py (2)

12-12: LGTM - Import aligned with unified event model.

The import path change from the infrastructure layer to app.domain.events.typed aligns with the PR's goal of consolidating to a single domain event system.


43-53: LGTM - mode="json" ensures consistent content hashing.

Using model_dump(mode="json") correctly serializes complex types (datetime, enums, UUIDs) to their JSON string representations before hashing. This prevents hash inconsistencies that could occur from serializing Python objects directly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/db/repositories/dlq_repository.py (2)

61-69: Aggregation pipeline uses incorrect field path for event_type.

The pipeline groups by "$event_type" (top-level), but according to the DLQMessageDocument schema, event_type is now nested under the event field. This should be "$event.event_type" to match the new document structure.

Proposed fix
         # Get counts by event type
         event_type_pipeline: list[Mapping[str, object]] = [
-            {"$group": {"_id": "$event_type", "count": {"$sum": 1}}},
+            {"$group": {"_id": "$event.event_type", "count": {"$sum": 1}}},
             {"$sort": {"count": -1}},
             {"$limit": 10},
         ]

107-111: Filter uses non-existent top-level event_type attribute.

DLQMessageDocument.event_type references a top-level field that no longer exists. According to the document schema, event_type is nested under event. This condition will either raise an AttributeError or silently fail to match documents.

Proposed fix
         conditions: list[Any] = [
             DLQMessageDocument.status == status if status else None,
             DLQMessageDocument.original_topic == topic if topic else None,
-            DLQMessageDocument.event_type == event_type if event_type else None,
+            {"event.event_type": str(event_type)} if event_type else None,
         ]

Note: Beanie's find() accepts raw query dicts mixed with ODM expressions. Alternatively, verify if Beanie 2.0.1 supports nested attribute access like DLQMessageDocument.event.event_type.

🧹 Nitpick comments (1)
backend/tests/integration/dlq/test_dlq_discard.py (1)

19-44: Consider extracting the shared helper to reduce duplication.

The _create_dlq_document helper is identical to the one in test_dlq_retry.py. Consider extracting it to a shared test utility (e.g., in tests/helpers/) to avoid maintaining duplicate code.

# In tests/helpers/dlq.py or similar
async def create_test_dlq_document(
    event_id: str | None = None,
    status: DLQMessageStatus = DLQMessageStatus.PENDING,
) -> DLQMessageDocument:
    ...
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05a7f8d and 7dba39f.

📒 Files selected for processing (4)
  • backend/app/db/repositories/dlq_repository.py
  • backend/tests/integration/db/repositories/test_dlq_repository.py
  • backend/tests/integration/dlq/test_dlq_discard.py
  • backend/tests/integration/dlq/test_dlq_retry.py
🧰 Additional context used
🧬 Code graph analysis (4)
backend/tests/integration/dlq/test_dlq_retry.py (2)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/db/docs/dlq.py (1)
  • DLQMessageDocument (11-43)
backend/app/db/repositories/dlq_repository.py (3)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/infrastructure/kafka/mappings.py (1)
  • get_event_class_for_type (91-93)
backend/app/db/docs/dlq.py (1)
  • DLQMessageDocument (11-43)
backend/tests/integration/dlq/test_dlq_discard.py (2)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
backend/app/db/docs/dlq.py (1)
  • DLQMessageDocument (11-43)
backend/tests/integration/db/repositories/test_dlq_repository.py (1)
backend/tests/unit/services/idempotency/test_middleware.py (1)
  • event (30-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (7)
backend/app/db/repositories/dlq_repository.py (2)

26-32: LGTM!

The _doc_to_message method correctly extracts event_type from the nested doc.event.event_type and reconstructs the event using the appropriate event class. The error message now properly references the local event_type variable.


125-127: LGTM!

The query pattern {"event.event_id": event_id} is consistently applied across get_message_by_id, mark_message_retried, and mark_message_discarded. This aligns with the new nested event structure and the index defined on event.event_id.

Also applies to: 165-174, 176-186

backend/tests/integration/db/repositories/test_dlq_repository.py (2)

24-69: LGTM!

Test documents are correctly constructed with the nested event dictionary structure containing event_id, event_type, metadata, and event-specific fields. This matches the new DLQMessageDocument schema.


85-86: LGTM!

The assertion correctly validates the nested event structure by accessing msg.event.event_id.

backend/tests/integration/dlq/test_dlq_retry.py (2)

27-44: LGTM!

The helper correctly constructs a DLQ document by:

  1. Creating a valid event via make_execution_requested_event
  2. Dumping to dict and overriding event_id for test predictability
  3. Passing the dict to DLQMessageDocument(event=event_dict, ...)

This approach maintains type consistency while allowing controlled test data.


62-65: LGTM!

All find_one queries consistently use {"event.event_id": event_id} to match the new document structure. The test assertions properly validate status transitions, timestamps, and reason fields.

Also applies to: 94-97, 115-117, 158-160, 181-183, 201-204, 222-225

backend/tests/integration/dlq/test_dlq_discard.py (1)

63-67: LGTM!

All find_one queries consistently use {"event.event_id": event_id}. The tests comprehensively cover discard scenarios including status transitions, timestamps, and custom reasons.

Also applies to: 99-102, 119-121, 139-141

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/db/repositories/user_settings_repository.py (1)

73-78: Delete orphaned UserSettingsDocument records in delete_user_settings.

The delete_user_settings method deletes from UserSettingsSnapshotDocument and EventDocument, but not from UserSettingsDocument—the primary settings collection where create_snapshot and get_snapshot operate. This leaves orphaned UserSettingsDocument records when a user is deleted.

Add deletion of UserSettingsDocument alongside the snapshot and event deletions:

Suggested fix
async def delete_user_settings(self, user_id: str) -> None:
    doc = await UserSettingsDocument.find_one(UserSettingsDocument.user_id == user_id)
    if doc:
        await doc.delete()
    await UserSettingsSnapshotDocument.find(UserSettingsSnapshotDocument.user_id == user_id).delete()
    await EventDocument.find(EventDocument.aggregate_id == f"user_settings_{user_id}").delete()
    self.logger.info(f"Deleted all settings data for user {user_id}")
backend/app/db/repositories/resource_allocation_repository.py (1)

23-28: The TOCTOU race is real, but the suggested fix is incomplete; use update_one() for true atomicity.

Two concurrent callers can both read the same doc in an "active" state and both update it to "released"—the check-then-act pattern is inherently racy. Adding a status filter to the query is a good step, but the suggested code still uses doc.set(), which is a separate operation after the if check and does not close the race window.

For atomic conditional updates with Beanie 2.0.1, use find_one(...filters...).update_one() instead:

Atomic fix with Beanie 2.0.1 API
from beanie.operators import Set

async def release_allocation(self, allocation_id: str) -> bool:
    now = datetime.now(timezone.utc)
    result = await ResourceAllocationDocument.find_one(
        (ResourceAllocationDocument.allocation_id == allocation_id)
        & (ResourceAllocationDocument.status == "active")
    ).update_one(
        Set({
            ResourceAllocationDocument.status: "released",
            ResourceAllocationDocument.released_at: now
        })
    )
    return result.modified_count > 0

This ensures the status filter and update happen in a single server-side operation, eliminating the race.

backend/app/db/repositories/user_repository.py (1)

29-29: Type annotation mismatch with field expressions in conditions list.

At line 29, conditions is typed as list[BaseFindOperator], but at line 41 it appends UserDocument.role == role, which is a field expression, not a BaseFindOperator. While find() accepts both types at runtime, this type annotation is incomplete. Line 34–37 appends Or(...) which does conform to BaseFindOperator, creating a heterogeneous list that violates the declared type.

Suggested fix
-        conditions: list[BaseFindOperator] = []
+        conditions: list[BaseFindOperator | Any] = []

Alternatively, use list[Any] for simplicity, or investigate whether Beanie exports a more specific union type that encompasses both field expressions and operators.

🧹 Nitpick comments (4)
backend/app/db/repositories/notification_repository.py (1)

202-213: Consider batching the subscription queries to avoid N+1 pattern.

Currently makes one database query per channel. While the number of channels is likely small, fetching all subscriptions in a single query would be more efficient.

♻️ Suggested optimization
 async def get_all_subscriptions(self, user_id: str) -> dict[NotificationChannel, DomainNotificationSubscription]:
     subs: dict[NotificationChannel, DomainNotificationSubscription] = {}
-    for channel in NotificationChannel:
-        doc = await NotificationSubscriptionDocument.find_one(
-            NotificationSubscriptionDocument.user_id == user_id,
-            NotificationSubscriptionDocument.channel == channel,
-        )
-        if doc:
-            subs[channel] = DomainNotificationSubscription.model_validate(doc, from_attributes=True)
-        else:
-            subs[channel] = DomainNotificationSubscription(user_id=user_id, channel=channel, enabled=True)
+    docs = await NotificationSubscriptionDocument.find(
+        NotificationSubscriptionDocument.user_id == user_id,
+    ).to_list()
+    existing = {doc.channel: doc for doc in docs}
+    for channel in NotificationChannel:
+        if channel in existing:
+            subs[channel] = DomainNotificationSubscription.model_validate(existing[channel], from_attributes=True)
+        else:
+            subs[channel] = DomainNotificationSubscription(user_id=user_id, channel=channel, enabled=True)
     return subs
backend/app/db/repositories/admin/admin_events_repository.py (1)

71-73: Consider using Beanie operators for consistency.

This query still uses dict syntax while the surrounding code now uses Beanie expressions. For full consistency, you could use Beanie's NE operator:

from beanie.operators import NE

related_query = EventDocument.find(
    EventDocument.metadata.correlation_id == doc.metadata.correlation_id,
    NE(EventDocument.event_id, event_id)
)

However, the current dict-based approach works correctly and this is purely a stylistic consideration.

backend/app/db/repositories/resource_allocation_repository.py (1)

9-13: Multi-arg find() syntax is correct in Beanie 2.0.1. Beanie's find() method explicitly accepts multiple filter expressions as separate positional arguments, so the code at lines 10–13 is valid and requires no changes.

The release_allocation() method (lines 23–28) follows a read-then-write pattern without an atomic check. If the document is deleted or modified between the find_one() and doc.set() calls, the update may fail or succeed silently depending on Beanie's behavior. Consider guarding the release with a status check (e.g., only release if currently "active") or using an atomic update if the business logic requires it, otherwise document that releases are intentionally idempotent.

backend/app/db/repositories/admin/admin_user_repository.py (1)

4-5: Type annotation may be inaccurate for Beanie 2.0 expression API.

In Beanie 2.0, direct field equality expressions (e.g., UserDocument.role == role) return expression objects that may not be BaseFindOperator instances. Since find() accepts Mapping | bool as well, this works at runtime but the type hint could be more accurate.

Consider using a broader type or Any for the conditions list if strict type checking is enabled.

Also applies to: 33-33

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dba39f and 84d0b9d.

📒 Files selected for processing (13)
  • backend/app/db/repositories/admin/admin_events_repository.py
  • backend/app/db/repositories/admin/admin_settings_repository.py
  • backend/app/db/repositories/admin/admin_user_repository.py
  • backend/app/db/repositories/event_repository.py
  • backend/app/db/repositories/execution_repository.py
  • backend/app/db/repositories/notification_repository.py
  • backend/app/db/repositories/replay_repository.py
  • backend/app/db/repositories/resource_allocation_repository.py
  • backend/app/db/repositories/saga_repository.py
  • backend/app/db/repositories/saved_script_repository.py
  • backend/app/db/repositories/sse_repository.py
  • backend/app/db/repositories/user_repository.py
  • backend/app/db/repositories/user_settings_repository.py
🧰 Additional context used
🧬 Code graph analysis (10)
backend/app/db/repositories/event_repository.py (2)
backend/app/db/docs/event.py (1)
  • EventDocument (13-71)
backend/app/services/coordinator/queue_manager.py (1)
  • execution_id (30-31)
backend/app/db/repositories/admin/admin_user_repository.py (2)
backend/app/domain/user/user_models.py (2)
  • User (44-57)
  • UserUpdate (60-80)
backend/app/db/repositories/user_repository.py (1)
  • update_user (53-62)
backend/app/db/repositories/saved_script_repository.py (3)
backend/app/api/routes/saved_scripts.py (1)
  • list_saved_scripts (31-38)
backend/app/services/saved_script_service.py (1)
  • list_saved_scripts (104-118)
backend/app/domain/saved_script/models.py (1)
  • DomainSavedScript (21-28)
backend/app/db/repositories/sse_repository.py (1)
backend/app/services/coordinator/queue_manager.py (1)
  • execution_id (30-31)
backend/app/db/repositories/replay_repository.py (2)
backend/app/services/event_replay/replay_service.py (1)
  • get_session (324-325)
backend/app/domain/replay/models.py (1)
  • ReplaySessionState (119-146)
backend/app/db/repositories/admin/admin_events_repository.py (1)
backend/app/db/docs/event.py (1)
  • EventDocument (13-71)
backend/app/db/repositories/notification_repository.py (1)
backend/app/domain/notification/models.py (1)
  • DomainNotification (16-45)
backend/app/db/repositories/user_repository.py (1)
backend/app/services/coordinator/queue_manager.py (1)
  • user_id (34-35)
backend/app/db/repositories/execution_repository.py (1)
backend/app/services/coordinator/queue_manager.py (1)
  • execution_id (30-31)
backend/app/db/repositories/user_settings_repository.py (2)
backend/app/services/coordinator/queue_manager.py (1)
  • user_id (34-35)
backend/app/domain/user/settings_models.py (1)
  • DomainUserSettings (34-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (37)
backend/app/db/repositories/sse_repository.py (2)

9-17: LGTM!

The migration from dict-style filter to Beanie expression-based query (ExecutionDocument.execution_id == execution_id) is correct and follows Beanie 2.x idiomatic patterns. This improves type safety and IDE support.


19-23: LGTM!

Consistent use of Beanie expression-based query matching the pattern in get_execution_status. The model_validate with from_attributes=True is the correct Pydantic v2 approach for converting ORM/document objects to domain models.

backend/app/db/repositories/replay_repository.py (3)

18-29: LGTM! Clean migration to Beanie expression syntax.

The conversion from dict-based filters to Beanie field-expression syntax (ReplaySessionDocument.session_id == session_id) is the correct approach for Beanie 2.x. This provides better type safety and IDE support while maintaining identical behavior.


48-54: LGTM!

Consistent use of Beanie expression syntax with proper null handling.


71-79: LGTM!

The doc.set() method is the correct Beanie approach for partial atomic updates, and the expression syntax change is consistent with the rest of the file.

backend/app/db/repositories/execution_repository.py (1)

29-29: LGTM! Clean migration to Beanie field expressions.

The changes consistently replace dict-based queries with Beanie's type-safe field expression syntax across get_execution, update_execution, write_terminal_result, and delete_execution. This improves type safety and IDE support while maintaining identical runtime behavior.

Also applies to: 38-38, 49-49, 84-84

backend/app/db/repositories/saved_script_repository.py (4)

11-16: LGTM!

The migration from Eq() to direct field equality expressions is the idiomatic Beanie query syntax. The find_one correctly combines both conditions with AND logic.


24-27: LGTM!

Consistent use of field equality expressions for the query filter.


35-43: LGTM!

Consistent query pattern for the delete operation.


45-47: LGTM!

The find() with field equality expression followed by to_list() is the standard Beanie pattern for retrieving multiple documents.

backend/app/db/repositories/notification_repository.py (7)

32-35: LGTM!

Clean migration to Beanie expression-based filters. The compound query ensures user-scoped access control.


44-47: LGTM!

Consistent use of Beanie expressions for the lookup query.


53-56: LGTM!

Expression syntax is correct and consistent with the repository pattern.


64-65: LGTM!

Correct usage of Beanie expressions with update_many() for bulk status updates.


70-73: LGTM!

Properly scoped deletion with expression-based filter.


171-174: LGTM!

Expression-based filter with appropriate default fallback when no subscription exists.


183-186: LGTM!

Expression-based filter correctly used in the upsert pattern.

backend/app/db/repositories/admin/admin_settings_repository.py (1)

19-19: LGTM! Clean migration to Beanie expression syntax.

The refactor from dict-based queries ({"settings_id": "global"}) to Beanie's typed field comparison (SystemSettingsDocument.settings_id == "global") is correct and consistent across all three methods. This provides better type safety and IDE support while maintaining identical behavior.

Also applies to: 43-43, 71-71

backend/app/db/repositories/admin/admin_events_repository.py (3)

65-65: LGTM! Consistent use of Beanie expression syntax for document lookups.

The refactor to typed field comparisons (EventDocument.event_id == event_id) is correct and aligns with the indexed field defined in EventDocument (see event.py line 19: event_id: Indexed(str, unique=True)).

Also applies to: 89-89


193-196: LGTM! Clean migration of replay session queries.

The conversion to Beanie expression syntax for ReplaySessionDocument lookups is correct. The surrounding logic (None checks, model validation) remains unchanged and appropriate.

Also applies to: 202-206, 209-211


254-254: LGTM!

Correctly migrated to Beanie expression syntax for ExecutionDocument lookup within the replay progress calculation.

backend/app/db/repositories/event_repository.py (3)

68-72: LGTM!

The migration from dict-based query {"event_id": event_id} to Beanie field expression EventDocument.event_id == event_id is correct and aligns with Beanie 2.x typed query patterns. This provides better type safety and IDE support.


148-152: LGTM!

The Or condition correctly uses Beanie field expressions for both execution_id and aggregate_id comparisons. This maintains the existing query semantics while adopting the typed query style.


325-330: LGTM!

Consistent migration to Beanie field expression syntax, matching the pattern used in get_event.

backend/app/db/repositories/user_settings_repository.py (2)

16-20: LGTM!

Correct migration to Beanie field expression syntax for querying user settings.


22-28: LGTM!

The upsert pattern is correctly implemented: find existing document, copy the id if found, then save() will update the existing record or insert a new one. The added logging improves observability.

backend/app/db/repositories/saga_repository.py (4)

30-36: LGTM!

The upsert pattern correctly migrates to Beanie field expressions. Finding the existing document by saga_id, copying the id if found, then calling save() ensures proper update-or-insert semantics.


45-47: LGTM!

Consistent migration to Beanie field expression syntax.


80-90: LGTM!

The migration to Beanie field expression is correct. The update logic properly modifies the document fields and saves.

Note: The condition if error_message: means an empty string won't update the field. If clearing the error message is needed, consider using if error_message is not None: instead.


92-94: LGTM!

Correct migration to Beanie field expression for the ExecutionDocument query.

backend/app/db/repositories/admin/admin_user_repository.py (5)

44-45: LGTM!

Direct field equality expression is idiomatic for Beanie 2.0.


53-84: LGTM!

Field expression queries are correctly implemented and consistent with the pattern used in user_repository.py.


86-96: LGTM!

Cascade delete queries using field expressions are correctly implemented.


106-114: LGTM!

Field expression query is correctly implemented.


98-102: Both nested field query patterns are valid for Beanie 2.0.

Line 98's EventDocument.metadata.user_id correctly accesses the nested Pydantic model (EventMetadata), producing the expected MongoDB dot-notation query metadata.user_id. This is confirmed by the multiple indexes in EventDocument.Settings that reference metadata.user_id, metadata.correlation_id, and metadata.service_name.

Line 101's SagaDocument.context_data["user_id"] correctly queries the dict field using bracket notation, which is the standard Beanie pattern for dynamic dict fields. This same pattern is already used elsewhere in the codebase (saga_repository.py:19).

backend/app/db/repositories/user_repository.py (2)

4-5: LGTM!

Import cleanup correctly removes Eq since direct field equality (==) is now used instead. The retained Or and RegEx operators are still in use.


13-15: LGTM!

The migration from dictionary-based queries to Beanie field expressions (UserDocument.field == value) is idiomatic and provides better type safety and IDE refactoring support.

Also applies to: 22-24, 53-56, 64-68

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 11 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/pyproject.toml">

<violation number="1" location="backend/pyproject.toml:126">
P2: Inconsistent version pinning: all other dependencies use exact versions with `==`, but this uses `>=`. This breaks reproducible builds and is inconsistent with the project's dependency management approach. Consider pinning to an exact version like `monggregate==0.22.1`.</violation>
</file>

<file name="backend/app/db/repositories/dlq_repository.py">

<violation number="1" location="backend/app/db/repositories/dlq_repository.py:74">
P1: Incorrect average age calculation. This formula computes the midpoint between oldest and newest timestamps, not the true average age across all messages. Consider computing the sum of ages in the aggregation pipeline or iterating over all timestamps to calculate the actual mean.</violation>
</file>

<file name="backend/app/db/repositories/event_repository.py">

<violation number="1" location="backend/app/db/repositories/event_repository.py:349">
P1: Using `EventDocument.aggregate_id` as a dictionary key in `$match` will not work correctly. MongoDB expects string field names as keys. The field descriptor object won't serialize to `"aggregate_id"` - it should be a string literal.</violation>

<violation number="2" location="backend/app/db/repositories/event_repository.py:350">
P1: Using `EventDocument.timestamp` as a dictionary key in `$sort` will not work correctly. MongoDB expects string field names as keys. Use a string literal instead.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
backend/tests/integration/test_dlq_routes.py (3)

97-108: Fix broken attribute references causing MyPy failures.

DLQMessageResponse no longer has event_id, event_type, or age_seconds attributes. Per the migration, access event properties via message.event.event_id and message.event.event_type. The age_seconds field was removed entirely.

🐛 Proposed fix
         # If there are messages, validate their structure
         for message in messages_response.messages:
             assert isinstance(message, DLQMessageResponse)
-            assert message.event_id is not None
-            assert message.event_type is not None
+            assert message.event.event_id is not None
+            assert message.event.event_type is not None
             assert message.original_topic is not None
             assert message.retry_count >= 0
             assert message.failed_at is not None
             assert message.status in DLQMessageStatus.__members__.values()
-
-            # Check age_seconds is reasonable
-            if message.age_seconds is not None:
-                assert message.age_seconds >= 0

148-178: Fix broken attribute references in test_get_single_dlq_message_detail.

Similar to the previous issue, DLQMessageDetail no longer has event_id or event_type at the top level. Additionally, the test extracts event_id from the raw JSON response but then asserts against the model's non-existent attribute.

🐛 Proposed fix
             # Validate detailed response
             detail_data = detail_response.json()
             message_detail = DLQMessageDetail(**detail_data)

             # Verify all fields are present
-            assert message_detail.event_id == event_id
+            assert message_detail.event.event_id == event_id
             assert message_detail.event is not None
-            assert isinstance(message_detail.event, dict)
-            assert message_detail.event_type is not None
+            assert message_detail.event.event_type is not None
             assert message_detail.original_topic is not None

Note: Also update the extraction of event_id from JSON at line 150 to use the nested path:

-            event_id = messages_data["messages"][0]["event_id"]
+            event_id = messages_data["messages"][0]["event"]["event_id"]

346-351: Fix broken attribute references in pagination test.

The set comprehensions reference msg.event_id which no longer exists.

🐛 Proposed fix
             # Messages should be different
             if page1.messages and page2.messages:
-                page1_ids = {msg.event_id for msg in page1.messages}
-                page2_ids = {msg.event_id for msg in page2.messages}
+                page1_ids = {msg.event.event_id for msg in page1.messages}
+                page2_ids = {msg.event.event_id for msg in page2.messages}
                 # Should have no overlap
                 assert len(page1_ids.intersection(page2_ids)) == 0
backend/app/db/repositories/dlq_repository.py (1)

89-94: Fix incorrect event_type field path in get_messages filter.

Line 92 uses DLQMessageDocument.event_type at the root level, but event_type is nested inside the event object. The docstring at line 12 of dlq.py explicitly states this: "Access event_id/event_type via event.event_id, event.event_type". The database index on line 39 of dlq.py also confirms the nested path: ("event.event_type", ASCENDING).

Other methods in this file access it correctly: line 49 uses DLQMessageDocument.event.event_type, and line 108 uses the nested path string "event.event_id".

The current code will silently fail to match any documents when filtering by event_type.

Proposed fix
         conditions: list[Any] = [
             DLQMessageDocument.status == status if status else None,
             DLQMessageDocument.original_topic == topic if topic else None,
-            DLQMessageDocument.event_type == event_type if event_type else None,
+            DLQMessageDocument.event.event_type == event_type if event_type else None,
         ]
backend/app/db/repositories/event_repository.py (2)

74-93: Type mismatch: event_type parameter is str but should be EventType.

The EventDocument.event_type field is typed as EventType (per the model definition), but this method accepts str. This inconsistency may cause comparison issues or require implicit conversion. Other methods like get_events_by_aggregate use EventType or list[EventType].

Suggested fix
     async def get_events_by_type(
             self,
-            event_type: str,
+            event_type: EventType,
             start_time: datetime | None = None,
             end_time: datetime | None = None,
             limit: int = 100,
             skip: int = 0,
     ) -> list[DomainEvent]:

347-362: Critical: Invalid aggregation pipeline keys - Beanie field references cannot be used as dict keys.

Lines 349-350 use EventDocument.aggregate_id and EventDocument.timestamp as dictionary keys in raw MongoDB aggregation stages. These resolve to FieldInfo objects at runtime, not strings. MongoDB expects string field names like "aggregate_id".

This will cause the aggregation to fail or produce incorrect results.

Suggested fix
     async def get_aggregate_replay_info(self, aggregate_id: str) -> EventReplayInfo | None:
         pipeline = [
-            {"$match": {EventDocument.aggregate_id: aggregate_id}},
-            {"$sort": {EventDocument.timestamp: 1}},
+            {"$match": {"aggregate_id": aggregate_id}},
+            {"$sort": {"timestamp": 1}},
             {
                 "$group": {
                     "_id": None,
                     "events": {"$push": "$$ROOT"},
                     "event_count": {"$sum": 1},
                     "event_types": {"$addToSet": S.field(EventDocument.event_type)},
                     "start_time": {"$min": S.field(EventDocument.timestamp)},
                     "end_time": {"$max": S.field(EventDocument.timestamp)},
                 }
             },
             {"$project": {"_id": 0}},
         ]
🤖 Fix all issues with AI agents
In @backend/app/db/repositories/dlq_repository.py:
- Around line 71-75: The avg_age_seconds currently sets a midpoint; replace it
with the true mean by summing all failed_at timestamps and dividing by count:
query for sum_failed_at using S.sum(S.field(DLQMessageDocument.failed_at)) and
compute avg_age_seconds as (now_timestamp * count - sum_failed_at) / count (or
equivalently total_age_seconds / count), making sure to handle count == 0 to
avoid division by zero; update the AgeStatistics construction in the age_stats
assignment to use this computed value (or rename if you intend to keep the
midpoint).

In @backend/app/db/repositories/event_repository.py:
- Around line 18-20: The import block is unsorted; reorder the imports into
proper lexical order by module and by symbol to satisfy Ruff (I001). Ensure the
module order is beanie.odm.enums, then beanie.operators, then monggregate, and
alphabetize the names in the beanie.operators import so that GTE, In, LT, LTE,
Not, Or, RegEx are listed in alphabetical order; alternatively run `uv run ruff
check . --fix` to auto-fix.

In @backend/pyproject.toml:
- Line 126: The dependency declaration for monggregate uses a range specifier
(monggregate>=0.22.1) which is inconsistent with the repo's exact-pin policy;
update the pyproject dependency for monggregate to an exact version (e.g.,
change monggregate>=0.22.1 to monggregate==0.22.1) so it matches other entries
and ensures reproducible builds.
🧹 Nitpick comments (5)
backend/app/api/routes/dlq.py (2)

47-50: Consider adding from_attributes=True for consistency.

Lines 31 and 58 use model_validate(obj, from_attributes=True), but this line omits it. While it may work if msg is already a dict-like object, adding from_attributes=True would be consistent with other endpoints and more robust if the source type changes.

♻️ Suggested change for consistency
-    messages = [DLQMessageResponse.model_validate(msg) for msg in result.messages]
+    messages = [DLQMessageResponse.model_validate(msg, from_attributes=True) for msg in result.messages]

104-106: Accessing private method _discard_message breaks encapsulation.

The route directly calls dlq_manager._discard_message(), which is a private method (indicated by the underscore prefix). Consider exposing a public method on DLQManager for discarding messages, or document that this private method is intended for internal use by the route layer.

backend/app/db/repositories/dlq_repository.py (1)

111-150: LGTM!

The two-stage aggregation pipeline correctly computes topic summaries with status breakdowns. The use of $arrayToObject to pivot status counts is a clean approach.

Minor observation: doc_count (line 133) and total_messages (line 129) compute the same value. You could reuse $total_messages in the divide expression on line 145 to avoid the redundant field.

backend/app/db/repositories/admin/admin_events_repository.py (2)

126-131: Inconsistent query pattern with Beanie field references as dict keys.

This query uses Beanie field objects as dict keys (EventDocument.timestamp, EventDocument.event_type), which differs from the standard string-key pattern used elsewhere in this file (e.g., line 71: {"metadata.correlation_id": doc.metadata.correlation_id}).

For consistency and reliability, consider using either:

  1. String keys: {"timestamp": {"$gte": start_time}, "event_type": {"$regex": ...}}
  2. Beanie operators: Use And(), GTE(), RegEx() operators as in _event_filter_conditions()

157-158: Type annotation is overly broad.

The type annotation list[HourlyEventCount | dict[str, Any]] is misleading since the list comprehension only produces HourlyEventCount instances. The dict[str, Any] alternative is never used.

♻️ Suggested fix
-events_by_hour: list[HourlyEventCount | dict[str, Any]] = [HourlyEventCount(**doc) for doc in hourly_result]
+events_by_hour = [HourlyEventCount(**doc) for doc in hourly_result]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84d0b9d and 80e4a2a.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • backend/app/api/routes/dlq.py
  • backend/app/db/repositories/admin/admin_events_repository.py
  • backend/app/db/repositories/dlq_repository.py
  • backend/app/db/repositories/event_repository.py
  • backend/app/db/repositories/saga_repository.py
  • backend/app/dlq/models.py
  • backend/app/domain/events/query_builders.py
  • backend/app/schemas_pydantic/dlq.py
  • backend/pyproject.toml
  • backend/tests/integration/test_dlq_routes.py
💤 Files with no reviewable changes (1)
  • backend/app/domain/events/query_builders.py
🧰 Additional context used
🧬 Code graph analysis (7)
backend/app/db/repositories/saga_repository.py (2)
backend/app/domain/events/event_models.py (1)
  • SortDirection (20-22)
backend/app/domain/saga/models.py (1)
  • Saga (10-27)
backend/app/schemas_pydantic/dlq.py (1)
backend/app/dlq/models.py (5)
  • AgeStatistics (138-145)
  • DLQMessageStatus (12-18)
  • EventTypeStatistic (129-135)
  • RetryStrategy (21-28)
  • TopicStatistic (119-126)
backend/app/db/repositories/admin/admin_events_repository.py (3)
backend/app/db/docs/event.py (1)
  • EventDocument (13-71)
backend/app/services/coordinator/queue_manager.py (2)
  • user_id (34-35)
  • execution_id (30-31)
backend/app/domain/events/event_models.py (2)
  • HourlyEventCount (114-116)
  • UserEventCount (120-122)
backend/app/api/routes/dlq.py (1)
backend/app/schemas_pydantic/dlq.py (3)
  • DLQStats (10-19)
  • DLQMessageDetail (93-113)
  • DLQTopicSummaryResponse (79-90)
backend/tests/integration/test_dlq_routes.py (1)
backend/app/dlq/models.py (4)
  • AgeStatistics (138-145)
  • DLQMessageStatus (12-18)
  • EventTypeStatistic (129-135)
  • TopicStatistic (119-126)
backend/app/db/repositories/event_repository.py (2)
backend/app/db/docs/event.py (1)
  • EventDocument (13-71)
backend/app/events/event_store.py (1)
  • get_events_by_type (123-147)
backend/app/dlq/models.py (2)
backend/app/core/utils.py (1)
  • StringEnum (6-31)
backend/app/domain/enums/events.py (1)
  • EventType (4-87)
🪛 GitHub Actions: MyPy Type Checking
backend/tests/integration/test_dlq_routes.py

[error] 99-99: "DLQMessageResponse" has no attribute "event_id"; maybe "event"? [attr-defined]


[error] 100-100: "DLQMessageResponse" has no attribute "event_type" [attr-defined]


[error] 107-107: "DLQMessageResponse" has no attribute "age_seconds" [attr-defined]


[error] 108-108: "DLQMessageResponse" has no attribute "age_seconds" [attr-defined]


[error] 160-160: "DLQMessageDetail" has no attribute "event_id"; maybe "event"? [attr-defined]


[error] 348-348: "DLQMessageResponse" has no attribute "event_id"; maybe "event"? [attr-defined]


[error] 349-349: "DLQMessageResponse" has no attribute "event_id"; maybe "event"? [attr-defined]

🪛 GitHub Actions: Ruff Linting
backend/app/db/repositories/saga_repository.py

[error] 1-1: I001 Import block is un-sorted or un-formatted. Organize imports. Fixable with '--fix'. Command: uv run ruff check . --config pyproject.toml

backend/app/db/repositories/event_repository.py

[error] 1-1: I001 Import block is un-sorted or un-formatted. Organize imports. Fixable with '--fix'. Command: uv run ruff check . --config pyproject.toml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Frontend
  • GitHub Check: Build Backend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (33)
backend/app/schemas_pydantic/dlq.py (4)

6-7: LGTM! Clean migration to typed models and DomainEvent.

The imports correctly pull in the typed statistics models from app.dlq and DomainEvent from the domain layer, aligning with the unified event model architecture.


15-18: Type safety improvement for DLQStats.

Using dict[DLQMessageStatus, int], list[TopicStatistic], list[EventTypeStatistic], and AgeStatistics instead of generic dict[str, int] and list[dict[str, Any]] provides better type checking and IDE support.


22-37: DLQMessageResponse correctly mirrors DLQMessage with nested DomainEvent.

The model now embeds event: DomainEvent as the central payload container. Consumers should access message.event.event_id and message.event.event_type instead of the previous top-level fields, as noted in the migration guide.


93-113: DLQMessageDetail aligned with DLQMessage structure.

Consistent with DLQMessageResponse, this detailed model now uses the embedded event: DomainEvent pattern with producer_id explicitly exposed.

backend/app/dlq/models.py (3)

31-52: Well-structured migration from dataclass to Pydantic BaseModel.

The DLQMessage model correctly:

  • Uses from_attributes=True for ORM compatibility
  • Embeds DomainEvent as the unified event payload
  • Provides sensible defaults for error handling scenarios
  • Documents the access pattern change in the docstring

The Field(default_factory=...) pattern for mutable defaults (datetime, dict) is the correct Pydantic approach.


118-157: Statistics models converted to Pydantic BaseModel with from_attributes.

TopicStatistic, EventTypeStatistic, AgeStatistics, and DLQStatistics are now Pydantic models enabling model_validate from ORM/domain objects. This aligns with the route layer changes that use model_validate(obj, from_attributes=True).


189-200: DLQTopicSummary converted to Pydantic BaseModel.

Consistent with other statistics models, enabling direct validation from repository results.

backend/tests/integration/test_dlq_routes.py (1)

50-76: LGTM! Statistics assertions correctly updated to typed models.

The test properly validates:

  • AgeStatistics instance for age stats
  • Iteration over DLQMessageStatus enum values
  • TopicStatistic and EventTypeStatistic typed instances
backend/app/api/routes/dlq.py (2)

28-31: LGTM! Correct use of model_validate with from_attributes.

The DLQStats.model_validate(stats, from_attributes=True) pattern correctly converts the domain statistics object to the Pydantic response model.


109-112: LGTM! Updated to modern Python type hints.

Using list[DLQTopicSummaryResponse] (lowercase) instead of List[DLQTopicSummaryResponse] aligns with Python 3.9+ conventions.

backend/app/db/repositories/dlq_repository.py (4)

1-20: LGTM!

Imports are correctly updated to use monggregate's Pipeline and S helpers, and unused typing imports have been cleaned up.


100-105: LGTM!

The use of model_validate(d, from_attributes=True) is the correct Pydantic v2 pattern for converting Beanie documents to domain models.


107-109: LGTM!

The nested path query for event.event_id is correct for the new data model structure.


152-173: LGTM!

Both methods correctly use the nested event.event_id path for lookups, consistent with the migrated data model. The find-then-save pattern is appropriate for DLQ operations where concurrent updates to the same message are unlikely.

backend/app/db/repositories/event_repository.py (5)

68-72: LGTM!

The migration to Beanie field-based query syntax is correct.


241-259: LGTM!

The cleanup_old_events method correctly uses Beanie operators for building query conditions.


316-328: LGTM!

The list_event_types method correctly uses S.field() within aggregation pipeline values.


330-342: LGTM!

The delete_event_with_archival method correctly uses Beanie field-based query syntax.


181-234: The usage of S.field() with Beanie field references is correct. Integration tests confirm the aggregation pipeline executes successfully and produces valid EventStatistics objects at runtime with monggregate 0.22.1. The pattern is used consistently across multiple repositories (saga_repository.py, dlq_repository.py, admin_events_repository.py) and correctly handles both simple fields (e.g., event_type) and nested fields (e.g., metadata.service_name).

backend/app/db/repositories/admin/admin_events_repository.py (6)

6-6: LGTM!

The monggregate import for Pipeline and S is appropriate for the new aggregation pipeline implementations.


65-65: LGTM!

Correct use of Beanie's field-based expression syntax for find_one(). This is more readable and type-safe than raw dict queries.


134-142: Same field reference concern applies here.

The by=EventDocument.event_type parameter passes a Beanie field object to monggregate's group(). Covered by the verification request above.


169-169: Good defensive filtering for null user_ids.

The if doc["user_id"] filter correctly excludes documents where user_id is None or empty, preventing invalid UserEventCount instances.


232-232: LGTM!

Consistent use of Beanie field-based expressions for find_one() lookups across ReplaySessionDocument and ExecutionDocument queries (lines 232, 241, 248, 293).


98-119: This pattern is correct and follows established codebase conventions.

The use of Beanie field references (e.g., EventDocument.timestamp) as dict keys in monggregate Pipeline is not non-standard—it's the standard pattern used consistently across the entire codebase. The same pattern appears in saga_repository.py, dlq_repository.py, and multiple other repositories with SagaDocument, DLQMessageDocument, and ExecutionDocument.

Beanie field references are designed to serialize correctly when used with monggregate Pipeline operations like .match(), .group(), and S.field(). This integration works reliably throughout the codebase. The alternative of using string field names would be inconsistent with established conventions and is not used anywhere in the repository.

No changes are needed.

backend/app/db/repositories/saga_repository.py (8)

30-36: LGTM!

The migration to field-equality query syntax (SagaDocument.saga_id == saga.saga_id) aligns with Beanie 2.x patterns. The upsert logic correctly preserves the document id for updates.


45-47: LGTM!

Clean field-equality query pattern.


49-66: LGTM!

Formatting adjustments with unchanged semantics.


80-90: LGTM!

Field-equality query migration looks correct. The conditional if error_message: appropriately only updates the error when provided.


92-94: LGTM!

Field-equality query on ExecutionDocument is correct.


103-118: LGTM!

Signature formatting adjusted; logic unchanged.


137-145: S.field() with Beanie field descriptors is the established pattern across the codebase.

The code correctly uses S.field(SagaDocument.completed_at) and S.field(SagaDocument.created_at). This same pattern is used consistently throughout dlq_repository.py, event_repository.py, and admin_events_repository.py with various document fields and nested fields. No changes needed.

Likely an incorrect or invalid review comment.


96-101: Use string field names instead of Beanie field descriptors with Pipeline.group().

The by parameter should receive a string field name (e.g., by="state" or by="$state"), not a Beanie field descriptor like SagaDocument.state. Monggregate is driver-agnostic and does not have special integration for Beanie field descriptors; it expects standard MongoDB aggregation pipeline inputs. The correct pattern is:

pipeline = Pipeline().group(by="state", query={"count": S.sum(1)})

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
backend/tests/integration/test_dlq_routes.py (3)

144-148: Inconsistent JSON access pattern will cause test failure.

Line 146 accesses the raw JSON dict using ["event_id"], but per the new domain event structure, event_id is nested under event. This is inconsistent with the updated Pydantic model access elsewhere (e.g., lines 99-100, 157-158).

🐛 Proposed fix
-            event_id = messages_data["messages"][0]["event_id"]
+            event_id = messages_data["messages"][0]["event"]["event_id"]

220-222: Inconsistent JSON access pattern will cause test failure.

Same issue as line 146: raw JSON dict access uses old ["event_id"] pattern instead of ["event"]["event_id"].

🐛 Proposed fix
-            event_ids = [msg["event_id"] for msg in messages_data["messages"][:2]]
+            event_ids = [msg["event"]["event_id"] for msg in messages_data["messages"][:2]]

256-258: Inconsistent JSON access pattern will cause test failure.

Same issue: raw JSON dict access uses old ["event_id"] pattern instead of ["event"]["event_id"].

🐛 Proposed fix
-            event_id = messages_data["messages"][0]["event_id"]
+            event_id = messages_data["messages"][0]["event"]["event_id"]
backend/app/db/repositories/event_repository.py (1)

348-363: Bug: Invalid field references in $match and $sort stages.

Lines 350-351 use EventDocument.aggregate_id and EventDocument.timestamp directly as dictionary keys. In Python, this evaluates to Beanie field descriptor objects, not string field names. MongoDB aggregation expects string keys in $match and $sort stages, so this pipeline will fail to match any documents.

The S.field() calls on lines 357-359 are correct because those aggregation operators ($addToSet, $min, $max) expect field reference strings like "$event_type".

🐛 Proposed fix
     async def get_aggregate_replay_info(self, aggregate_id: str) -> EventReplayInfo | None:
         pipeline = [
-            {"$match": {EventDocument.aggregate_id: aggregate_id}},
-            {"$sort": {EventDocument.timestamp: 1}},
+            {"$match": {"aggregate_id": aggregate_id}},
+            {"$sort": {"timestamp": 1}},
             {
                 "$group": {
                     "_id": None,
                     "events": {"$push": "$$ROOT"},
                     "event_count": {"$sum": 1},
                     "event_types": {"$addToSet": S.field(EventDocument.event_type)},
                     "start_time": {"$min": S.field(EventDocument.timestamp)},
                     "end_time": {"$max": S.field(EventDocument.timestamp)},
                 }
             },
             {"$project": {"_id": 0}},
         ]
🤖 Fix all issues with AI agents
In @backend/app/db/repositories/saga_repository.py:
- Around line 138-147: S.field() is currently being called with Beanie document
attributes; change calls like S.field(SagaDocument.completed_at) and
S.field(SagaDocument.created_at) to use string field names ("completed_at",
"created_at") inside the duration_pipeline projection, and apply the same fix
wherever S.field(...) is used in dlq_repository.py, event_repository.py, and
admin_events_repository.py so all aggregates pass string field names; keep the
rest of the aggregation (group and milliseconds-to-seconds conversion) unchanged
and ensure references to SagaDocument.find(...) aggregate remain the same.
- Around line 97-102: The aggregation group calls are using raw Beanie
FieldDescriptor objects (e.g., SagaDocument.state, EventDocument.event_type,
EventDocument.metadata.user_id, DLQMessageDocument.status,
DLQMessageDocument.event.event_type) which prevents monggregate from resolving
field paths; update each Pipeline.group(by=...) to wrap the field descriptors
with S.field(...) (for example change by=SagaDocument.state to
by=S.field(SagaDocument.state)) in the methods count_sagas_by_state and the
other group usages noted so the group key is a proper field path.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80e4a2a and 87516a8.

📒 Files selected for processing (3)
  • backend/app/db/repositories/event_repository.py
  • backend/app/db/repositories/saga_repository.py
  • backend/tests/integration/test_dlq_routes.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/app/db/repositories/event_repository.py (1)
backend/app/db/docs/event.py (1)
  • EventDocument (13-71)
backend/app/db/repositories/saga_repository.py (1)
backend/app/domain/saga/models.py (1)
  • Saga (10-27)
backend/tests/integration/test_dlq_routes.py (1)
backend/app/dlq/models.py (4)
  • AgeStatistics (138-145)
  • DLQMessageStatus (12-18)
  • EventTypeStatistic (129-135)
  • TopicStatistic (119-126)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (20)
backend/tests/integration/test_dlq_routes.py (6)

5-5: LGTM!

Import updated correctly to use the new typed models from app.dlq.


50-76: LGTM!

Tests correctly updated to validate the new typed statistic models (AgeStatistics, TopicStatistic, EventTypeStatistic) and iterate over DLQMessageStatus enum values.


99-104: LGTM!

Access pattern correctly updated to use message.event.event_id and message.event.event_type for Pydantic model access.


155-158: LGTM!

Correctly updated to access event fields via message_detail.event.event_id and message_detail.event.event_type.


188-190: LGTM!

Minor formatting adjustment, no functional change.


343-344: LGTM!

Pagination test correctly updated to use msg.event.event_id for Pydantic model access.

backend/app/db/repositories/event_repository.py (7)

7-7: LGTM!

The S helper import from monggregate is correctly added to support type-safe field references in aggregation pipelines.


69-73: LGTM!

The refactor from dict-based query to Beanie field comparison (EventDocument.event_id == event_id) is correct and provides better type safety.


146-167: LGTM!

The get_execution_events method correctly uses Beanie field comparisons within the Or operator for matching either execution_id or aggregate_id.


182-235: LGTM!

The aggregation pipeline correctly uses S.field() for type-safe field references in grouping and date operations. The faceted aggregation for statistics by type, service, and hour is well-structured.


317-329: LGTM!

The list_event_types method correctly uses S.field() for the group stage, consistent with other aggregation methods in this file.


331-343: LGTM!

The refactor to use Beanie field comparison for finding the event is correct and consistent with other methods.


242-260: LGTM!

The cleanup_old_events method correctly uses Beanie operators (LT, In) with field references for the time-based cleanup query.

backend/app/db/repositories/saga_repository.py (7)

7-7: LGTM!

The monggregate import aligns with the migration to Pipeline-based aggregations across the repository.


31-37: LGTM!

The migration to field equality expressions and the upsert pattern (assigning existing.id to preserve the document identity) is correct for Beanie.


46-48: LGTM!

Clean migration to field equality expression.


50-52: LGTM!

Formatting adjustment with no semantic change.


81-91: LGTM!

Field equality expression is consistent with the other query migrations in this file.


93-95: LGTM!

Consistent migration to field equality for ExecutionDocument.


104-109: LGTM!

Formatting adjustment with no semantic change.

Comment on lines 97 to 102
async def count_sagas_by_state(self) -> dict[str, int]:
pipeline = [{"$group": {"_id": "$state", "count": {"$sum": 1}}}]
pipeline = Pipeline().group(by=SagaDocument.state, query={"count": S.sum(1)})
result = {}
async for doc in SagaDocument.aggregate(pipeline):
async for doc in SagaDocument.aggregate(pipeline.export()):
result[doc["_id"]] = doc["count"]
return result
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

monggregate Pipeline group by parameter documentation

💡 Result:

You likely mean MongoDB's aggregation $group stage (used to "group by" in pipelines). Key points:

  • Syntax: { $group: { _id: , : { : }, ... } } — _id is the group key (use null to aggregate all docs). [1]
  • Group key: any valid expression (field path like "$field", compound object, or computed expression). [1]
  • Accumulators: common operators include $sum, $avg, $min, $max, $first, $last, $push, $addToSet, $count, and newer ones like $top/$bottom/$firstN/$lastN. [1]
  • Memory/disk: $group uses up to 100 MB RAM by default; set allowDiskUse: true on db.collection.aggregate(...) to spill to disk for large groups. [1]
  • Performance: $group is a blocking stage; sorting+grouping patterns can be optimized with indexes when using $first/$last. [1]
  • Examples: grouping by a field: db.collection.aggregate([{ $group: { _id: "$item", total: { $sum: "$amount" } } }]). [1]

For full reference and complete list of accumulators, examples, and behavior, see the official MongoDB docs: [1] and an alternate manual copy: [2].

Sources
[1] MongoDB Manual — $group (aggregation stage).
[2] $group (aggregation) — MongoDB Manual (mirror).


🏁 Script executed:

# Find SagaDocument definition and understand its structure
find . -name "*.py" -type f | xargs grep -l "class SagaDocument" | head -5

Repository: HardMax71/Integr8sCode

Length of output: 95


🏁 Script executed:

# Look for other uses of Pipeline().group() in the codebase to see patterns
rg "Pipeline\(\)\.group" --type py -A 2 -B 1

Repository: HardMax71/Integr8sCode

Length of output: 1722


🏁 Script executed:

# Read the SagaDocument definition
cat -n ./backend/app/db/docs/saga.py | head -50

Repository: HardMax71/Integr8sCode

Length of output: 1905


🌐 Web query:

monggregate Python library documentation group method

💡 Result:

The monggregate Pipeline.group(...) method builds a $group stage; call signature (examples from the docs):

pipeline.group(
by="year",
query={
"movie_count": {"$sum": 1},
"movie_titles": {"$push": "$title"}
}
)

Notes:

  • monggregate accepts a "by" argument instead of MongoDB's _id (convenience alias).
  • The "query" argument contains accumulator expressions (e.g., $sum, $push) as in standard MongoDB $group usage. (vianneymi.github.io)

Citations:


🏁 Script executed:

# Check if there's a requirements.txt or pyproject.toml to understand monggregate version
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "poetry.lock" | head -5

Repository: HardMax71/Integr8sCode

Length of output: 90


🏁 Script executed:

# Search for any monggregate imports or usage examples in the codebase
rg "from monggregate" --type py -A 2 | head -30

Repository: HardMax71/Integr8sCode

Length of output: 988


🏁 Script executed:

# Read pyproject.toml to see monggregate version
cat ./backend/pyproject.toml | grep -A 5 -B 5 monggregate

Repository: HardMax71/Integr8sCode

Length of output: 284


🏁 Script executed:

# Search for test files related to saga_repository
find . -path "*/test*" -name "*saga*" -type f | head -10

Repository: HardMax71/Integr8sCode

Length of output: 430


🏁 Script executed:

# Look for any tests of aggregation pipelines
rg "count_sagas_by_state" --type py

Repository: HardMax71/Integr8sCode

Length of output: 172


🌐 Web query:

monggregate Beanie field reference support MongoDB aggregation

💡 Result:

Short answer: Yes — Beanie lets you run MongoDB aggregation pipelines (including stages that reference other collections such as $lookup/$graphLookup), but Beanie does not automatically "follow" Beanie Link/Reference fields for you inside an aggregation; you must express the join/field-reference with standard aggregation stages or project fields yourself. Use Document.aggregate(...) (or .find(...).aggregate(...)) with a pipeline and, if you want typed outputs, pass a projection_model. See examples and notes below.

Key refs:

  • Beanie aggregation usage and projection_model: Beanie docs (Aggregation). [1]
  • Beanie relations (Link fields are supported for top‑level fields, but fetching/resolving is an explicit operation — not automatic in arbitrary aggregation pipelines). [2]
  • MongoDB aggregation reference (use $lookup/$graphLookup and standard aggregation expressions to reference fields/other collections). [3]

If you want, tell me your Beanie Document models and the aggregation you want to run and I’ll draft the exact pipeline + projection_model code.


🌐 Web query:

monggregate GitHub source code Pipeline group method implementation

💡 Result:

The Pipeline.group implementation is in the monggregate GitHub repo at src/monggregate/pipeline.py — see the repository and documentation for the group method and examples. [1][2]

Sources:

  1. VianneyMI/monggregate — GitHub repository (root; file path: src/monggregate/pipeline.py). https://github.com/VianneyMI/monggregate
  2. Monggregate documentation (pipeline/group usage examples). https://vianneymi.github.io/monggregate/

If you want, I can fetch and paste the exact group() method source from src/monggregate/pipeline.py.

Citations:


🏁 Script executed:

# Check integration tests to see if count_sagas_by_state is actually invoked
grep -r "count_sagas_by_state" ./backend/tests/ --type py

Repository: HardMax71/Integr8sCode

Length of output: 183


🏁 Script executed:

# Look for any admin routes or tests that call aggregation methods
rg "count_sagas" --type py -B 2 -A 5

Repository: HardMax71/Integr8sCode

Length of output: 794


🏁 Script executed:

# Check what type SagaDocument.state actually is at runtime
# by looking at how it's used in other parts of the code
rg "SagaDocument\.state" --type py | head -20

Repository: HardMax71/Integr8sCode

Length of output: 772


🌐 Web query:

site:github.com/VianneyMI/monggregate group method by parameter type handling

💡 Result:

I can look that up in the repository and show how the group(...) method handles different parameter types. Do you want me to search the monggregate code on GitHub now and return the exact implementation and examples?


🌐 Web query:

monggregate "by=" parameter field descriptor Python object

💡 Result:

The "by=" parameter in monggregate is the field/field-expression descriptor used to specify the field(s) (or expression) to operate on for stages like group() and sort(). In practice you pass:

  • a string field name (e.g. by="year"),
  • an _id/compound/group expression (for group by) or
  • an expression built with monggregate's S helpers (S.*) when you need computed keys.

Examples in the official docs show sort(by="year") and group(by="year", query={...}) and using S expressions for more complex keys. [1][2]

Sources:

  • monggregate usage examples (sort/group). [1]
  • monggregate docs / tutorial pages (expressions & S helper usage). [2]

🏁 Script executed:

# Check if there are any exception handlers or error logs around aggregation calls
rg "aggregate\(" ./backend/app/db/repositories/ --type py -A 3 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 3659


🏁 Script executed:

# Check imports to see what S is and understand monggregate API better
rg "from monggregate import" ./backend/app/db/repositories/ --type py

Repository: HardMax71/Integr8sCode

Length of output: 409


🏁 Script executed:

# Look at pyproject.toml more carefully for any monggregate-related config
cat ./backend/pyproject.toml | head -100

Repository: HardMax71/Integr8sCode

Length of output: 3029


🏁 Script executed:

# Search for any code that calls count_sagas_by_state
rg "count_sagas_by_state" . --type py

Repository: HardMax71/Integr8sCode

Length of output: 174


🏁 Script executed:

# Check if there are any admin or debug endpoints that might call aggregation methods
rg "get_saga_stats\|count_sagas_by_state\|state_pipeline" ./backend --type py -B 2 -A 5

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Look for Beanie FieldInfo type and its __str__ method
rg "class.*FieldInfo" --type py | head -10

Repository: HardMax71/Integr8sCode

Length of output: 48


🌐 Web query:

Beanie FieldInfo __str__ conversion to field name string

💡 Result:

Use the model's model_fields keys (or the attribute name you defined) instead of relying on FieldInfo.str. FieldInfo is metadata for a field; it does not hold the attribute name as its string representation. Read its attributes (alias / serialization_alias) or inspect the model_fields mapping to get the field name.

Example (Beanie / Pydantic v2 style):

  • preferred: get names from the model_fields dict
    for name, info in MyDoc.model_fields.items():
    print(name) # actual attribute/field name
    print(info.alias) # alias if set (may be _Unset/None)

  • fallback: use serialization alias when present
    field_name = info.serialization_alias or info.alias or name

Reference: Pydantic FieldInfo documentation (attributes like alias / serialization_alias; FieldInfo is field metadata). [1]

Sources
[1] Pydantic — Fields / FieldInfo.


🏁 Script executed:

# Check actual test files to see if they mock or test aggregations
cat -n ./backend/tests/integration/test_saga_routes.py | head -80

Repository: HardMax71/Integr8sCode

Length of output: 3816


🏁 Script executed:

# Search for calls to get_saga_stats
rg "get_saga_stats" . --type py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Look for admin routes or services that call repository methods
rg "saga_repository\|SagaRepository" ./backend/app --type py -B 2 -A 3 | head -50

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check if there are any admin endpoints that trigger aggregations
fd "admin.*route\|route.*admin" --type f ./backend/app --extension py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Search for any actual usage of aggregation pipelines in service/route layers
rg "\.aggregate\(" ./backend/app --type py -B 3 | head -60

Repository: HardMax71/Integr8sCode

Length of output: 4748


🏁 Script executed:

# Check what methods call get_saga_stats or use aggregations in services/routes
rg "get_saga_stats\|\.aggregate\(" ./backend/app/db --type py -l

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Look at admin routes specifically
find ./backend/app -name "*admin*" -type f -name "*.py" | xargs grep -l "stats\|aggregate" 2>/dev/null

Repository: HardMax71/Integr8sCode

Length of output: 333


🏁 Script executed:

# Read the complete get_saga_stats method to understand its context
rg "async def get_saga_stats" -A 30 ./backend/app/db/repositories/saga_repository.py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Search for calls to get_saga_stats in services or routes
rg "get_saga_stats" ./backend --type py -r

Repository: HardMax71/Integr8sCode

Length of output: 479


🌐 Web query:

monggregate Pipeline group by= Beanie field descriptor object support

💡 Result:

Short answer: Yes — Beanie’s built‑in aggregate helpers (avg, sum, count, etc.) accept Beanie FieldDescriptor properties (e.g. Product.price). For raw aggregation pipelines you must still use MongoDB field paths (strings like "$field" or "$nested.field") in the pipeline; Beanie’s .aggregate() accepts a native pipeline (dicts) and a projection_model. Example patterns:

  • Using helper (accepts FieldDescriptor):

    • await Product.avg(Product.price). [1][2]
  • Using raw pipeline (use "$..." field paths):

    • await Product.find(...).aggregate([{"$group": {"_id": "$category.name", "total": {"$avg": "$price"}}}], projection_model=OutModel).to_list() . [1][2]

Sources:
[1] Beanie docs — Aggregations (examples showing avg(Product.price) and native pipeline usage).
[2] Beanie PyPI / tutorial pages with same examples.


🏁 Script executed:

# Search more broadly for calls to saga stats or repository in services
rg "saga_repository\|SagaRepository" ./backend/app/services --type py -B 2 -A 5

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check admin events service to see the pattern
cat -n ./backend/app/services/admin/admin_events_service.py | head -100

Repository: HardMax71/Integr8sCode

Length of output: 3877


🏁 Script executed:

# Look at the complete saga_repository to find get_saga_stats signature
cat -n ./backend/app/db/repositories/saga_repository.py | grep -A 40 "get_saga_stats"

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Test what monggregate exports when using a FieldDescriptor
python3 << 'EOF'
# Simulate the issue
class MockFieldDescriptor:
    def __init__(self, name):
        self.name = name
    
    def __str__(self):
        return self.name
    
    def __repr__(self):
        return f"MockFieldDescriptor({self.name})"

# If monggregate just converts to string, it would produce "state" not "$state"
field = MockFieldDescriptor("state")
print(f"String conversion: {str(field)}")
print(f"Repr: {repr(field)}")
# MongoDB expects "$state" for field reference in aggregation
EOF

Repository: HardMax71/Integr8sCode

Length of output: 152


🏁 Script executed:

# Search for admin routes
find ./backend/app -path "*/routes/*admin*" -o -path "*/admin/*route*" | head -10

Repository: HardMax71/Integr8sCode

Length of output: 263


🏁 Script executed:

# Look for any routes that reference saga stats
rg "saga.*stats\|stats.*saga" ./backend/app --type py -i

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check the admin_events_repository to see the pattern
cat -n ./backend/app/db/repositories/admin/admin_events_repository.py | head -120

Repository: HardMax71/Integr8sCode

Length of output: 5623


🏁 Script executed:

# Directly check what happens with the field descriptor - look for actual test or usage
rg "SagaDocument\.state\|DLQMessageDocument\.status" ./backend/app --type py -B 1 -A 1 | head -50

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Continue reading admin_events_repository to see more pattern examples
cat -n ./backend/app/db/repositories/admin/admin_events_repository.py | sed -n '95,180p'

Repository: HardMax71/Integr8sCode

Length of output: 4540


🏁 Script executed:

# Search for other uses of group(by=...) to see if there's a consistent pattern
rg "\.group\(by=" ./backend --type py -A 1 -B 1

Repository: HardMax71/Integr8sCode

Length of output: 2647


🏁 Script executed:

# Read more of saga_repository to see all group calls
cat -n ./backend/app/db/repositories/saga_repository.py | sed -n '90,180p'

Repository: HardMax71/Integr8sCode

Length of output: 2977


🏁 Script executed:

# Search for any Beanie/monggregate integration documentation or test
find ./backend/tests -name "*.py" -type f -exec grep -l "Pipeline\|group\|aggregate" {} \; | head -5

Repository: HardMax71/Integr8sCode

Length of output: 1221


🏁 Script executed:

# Look at one of the test files to see if aggregations are tested
cat -n ./backend/tests/integration/services/saga/test_saga_service.py | head -100

Repository: HardMax71/Integr8sCode

Length of output: 1060


🌐 Web query:

monggregate source code Beanie FieldDescriptor support how convert field reference

💡 Result:

Short answer + code: detect Beanie field/descriptor and extract its actual field name (alias) before building the mongo field reference.

Example helper to add to monggregate source (Python):

from typing import Any
from beanie.odm.fields import ExpressionField # optional import for exact type-check
from pydantic import BaseModel
from pydantic.fields import ModelField, FieldInfo

def to_field_ref(field: Any) -> str:
"""
Convert possible inputs (string, Beanie expression/field descriptor,
Pydantic FieldInfo/ModelField, dataclass attr, etc.) to a Mongo field
reference string (e.g. "a.b" or "$a.b").
"""
# already a string -> return as-is (no leading $)
if isinstance(field, str):
return field

# Beanie ExpressionField (and similar descriptors) implement attr access
# and behave like strings — try str() first
try:
    s = str(field)
except Exception:
    s = None
if s:
    # ExpressionField.__repr__/__str__ yields the path (e.g. "comments.count")
    # so prefer that when available
    return s

# Pydantic ModelField or FieldInfo -> use name or alias
if isinstance(field, ModelField):
    return field.alias or field.name
if isinstance(field, FieldInfo):
    # FieldInfo has no 'name' — caller should provide name; fallback to repr
    return getattr(field, "alias", repr(field))

# generic object with common attributes
for attr in ("alias", "name", "_name", "field_name"):
    if hasattr(field, attr):
        val = getattr(field, attr)
        if isinstance(val, str):
            return val

# last resort: attribute access for simple descriptor objects
if hasattr(field, "__getattr__"):
    try:
        return str(field)
    except Exception:
        pass

raise TypeError(f"Unsupported field descriptor: {type(field)!r}")

Usage when building the pipeline:

  • call to_field_ref(x) to get "a.b", then use f"${to_field_ref(x)}" wherever monggregate expects a $-prefixed field path.

Notes / rationale:

  • monggregate expects plain field paths when composing expressions; convert Beanie ExpressionField / descriptors to their string path first. See monggregate docs for how expressions use Expression.field(...) and string paths. [1]
  • Beanie field types and expression fields are in Beanie's fields API (ExpressionField supports attribute access / indexing and is string-like). Use its str()/repr() or .name/.alias when available. [2]

References:
[1] Monggregate docs (usage / expressions). (vianneymi.github.io)
[2] Beanie fields / ExpressionField docs. (beanie-odm.dev)

If you want, I can make a PR patch for monggregate showing exactly where to call this helper and add unit tests — tell me the repo/branch.

Citations:


🏁 Script executed:

# Check if there's any monggregate version history or changelog that addresses this
find ./backend -name "*.lock" -o -name "*.txt" | xargs grep -i "monggregate" 2>/dev/null | head -20

Repository: HardMax71/Integr8sCode

Length of output: 760


🏁 Script executed:

# Final check: Look for S.field() usage patterns to confirm best practice
rg "S\.field\(" ./backend/app/db/repositories --type py -B 1 -A 1 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 3577


🏁 Script executed:

# Check if admin_events_service actually calls get_event_stats (production usage)
rg "get_event_stats" ./backend/app --type py -B 2 -A 2

Repository: HardMax71/Integr8sCode

Length of output: 2104


🏁 Script executed:

# Confirm the pattern across all affected locations one more time
rg "\.group\(by=[A-Z]" ./backend/app/db/repositories --type py -A 1

Repository: HardMax71/Integr8sCode

Length of output: 1915


Wrap Beanie field descriptors with S.field() in Pipeline.group(by=...) calls.

monggregate's by= parameter requires explicit field path conversion. Passing raw Beanie FieldDescriptor objects (e.g., SagaDocument.state) will fail to group correctly. Use S.field() wrapper instead, consistent with the accumulators in the query= parameter. This affects production code: admin_events_repository.get_event_stats() is exposed via GET /api/admin/events/stats.

Affected locations
  • saga_repository.py:98 - by=SagaDocument.state
  • saga_repository.py:127 - by=SagaDocument.state
  • admin_events_repository.py:138 - by=EventDocument.event_type
  • admin_events_repository.py:163 - by=EventDocument.metadata.user_id
  • dlq_repository.py - by=DLQMessageDocument.status and by=DLQMessageDocument.event.event_type

Fix: Wrap with S.field(), e.g., by=S.field(EventDocument.event_type) (already used correctly in query accumulators on the same lines).

🤖 Prompt for AI Agents
In @backend/app/db/repositories/saga_repository.py around lines 97 - 102, The
aggregation group calls are using raw Beanie FieldDescriptor objects (e.g.,
SagaDocument.state, EventDocument.event_type, EventDocument.metadata.user_id,
DLQMessageDocument.status, DLQMessageDocument.event.event_type) which prevents
monggregate from resolving field paths; update each Pipeline.group(by=...) to
wrap the field descriptors with S.field(...) (for example change
by=SagaDocument.state to by=S.field(SagaDocument.state)) in the methods
count_sagas_by_state and the other group usages noted so the group key is a
proper field path.

Comment on lines +138 to 147
duration_pipeline = (
Pipeline()
.project(
duration={"$subtract": [S.field(SagaDocument.completed_at), S.field(SagaDocument.created_at)]}
)
.group(by=None, query={"avg_duration": S.avg("$duration")})
)
avg_duration = 0.0
async for doc in SagaDocument.find(*completed_conditions).aggregate(duration_pipeline):
async for doc in SagaDocument.find(*completed_conditions).aggregate(duration_pipeline.export()):
avg_duration = doc["avg_duration"] / 1000.0 if doc["avg_duration"] else 0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check monggregate S.field() usage patterns in the codebase
echo "=== Searching for S.field() usage ==="
rg -n "S\.field\(" --type py -C2

echo -e "\n=== Checking imports of monggregate ==="
rg -n "from monggregate|import monggregate" --type py

echo -e "\n=== Checking requirements or setup files ==="
fd -e "requirements*.txt" -e "setup.py" -e "pyproject.toml" | head -10 | xargs cat 2>/dev/null | grep -i monggregate || echo "No monggregate found"

Repository: HardMax71/Integr8sCode

Length of output: 7501


🏁 Script executed:

#!/bin/bash
# Look at the sagadocument and S imports in the repository
echo "=== Checking SagaDocument definition ==="
rg -n "class SagaDocument" --type py -A5

echo -e "\n=== Checking S import ==="
rg -n "from.*import.*S\b|import.*\bS\b" --type py | head -20

echo -e "\n=== Checking the file in question for context ==="
head -50 backend/app/db/repositories/saga_repository.py 2>/dev/null || echo "File not found, trying to locate it"
fd saga_repository.py

Repository: HardMax71/Integr8sCode

Length of output: 4215


🏁 Script executed:

#!/bin/bash
# Check how Beanie fields are used - if they serialize to strings
echo "=== Checking SagaDocument field definitions ==="
cat backend/app/db/docs/saga.py | head -100

echo -e "\n=== Searching for S.field() calls with raw MongoDB operators (strongest evidence) ==="
rg -n 'S\.field\(' backend/app/db/repositories/ -C1 | grep -E '(\$subtract|\$group|\$min|\$max|\$sum|_id)' -A1 -B1

Repository: HardMax71/Integr8sCode

Length of output: 4103


🏁 Script executed:

#!/bin/bash
# Check if there's any issue or failure logs related to S.field usage
echo "=== Looking for tests that verify S.field() works correctly ==="
fd -e "test_*.py" -e "*_test.py" | xargs grep -l "S\.field" 2>/dev/null | head -5

echo -e "\n=== Check git history or issues mentioning S.field ==="
git log --all --oneline --grep="S\.field" 2>/dev/null | head -5 || echo "No git history available"

echo -e "\n=== Verify that these operations work in practice by checking if there are actual usages ==="
rg -n "get_saga_statistics|get_dlq_stats" backend/app --type py -A2 -B2 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 2596


🌐 Web query:

monggregate S.field() Beanie document attributes

💡 Result:

Use S.field("dot.path") to reference document attributes (as stored in MongoDB) inside monggregate expressions. Example pattern from a real project:

  • S.field("ingredients.quantity") — accesses nested ingredients.quantity.
  • Use it inside operators: S.multiply(S.field("ingredients.quantity"), S.divide(...)).
  • Use the document's stored field names (dot notation for nested fields); you don't need Beanie model attributes/objects — pass the DB field path as a string.

Sources: monggregate docs/examples and a code example using S.field("ingredients.quantity") in a project writeup. [1][2]

References

  1. Monggregate documentation/examples. [1]
  2. Example usage (Baker project using monggregate showing S.field("ingredients.quantity")). [2]

Use string field names with S.field() instead of Beanie document attributes.

The S.field() function from monggregate expects string field names (e.g., S.field("completed_at")), not Beanie document attributes. Lines 141 should use:

Suggested fix
duration_pipeline = (
    Pipeline()
    .project(
        duration={"$subtract": [S.field("completed_at"), S.field("created_at")]}
    )
    .group(by=None, query={"avg_duration": S.avg("$duration")})
)

This same pattern issue exists across dlq_repository.py, event_repository.py, and admin_events_repository.py—all should pass string field names to S.field().

The milliseconds-to-seconds conversion on line 147 is correct.

🤖 Prompt for AI Agents
In @backend/app/db/repositories/saga_repository.py around lines 138 - 147,
S.field() is currently being called with Beanie document attributes; change
calls like S.field(SagaDocument.completed_at) and
S.field(SagaDocument.created_at) to use string field names ("completed_at",
"created_at") inside the duration_pipeline projection, and apply the same fix
wherever S.field(...) is used in dlq_repository.py, event_repository.py, and
admin_events_repository.py so all aggregates pass string field names; keep the
rest of the aggregation (group and milliseconds-to-seconds conversion) unchanged
and ensure references to SagaDocument.find(...) aggregate remain the same.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/db/repositories/dlq_repository.py (1)

93-98: Bug: event_type filter references non-existent field.

DLQMessageDocument.event_type doesn't exist — per DLQMessageDocument (see backend/app/db/docs/dlq.py), event_type is nested inside event. This will cause the filter to fail or return no matches.

Proposed fix
         conditions: list[Any] = [
             DLQMessageDocument.status == status if status else None,
             DLQMessageDocument.original_topic == topic if topic else None,
-            DLQMessageDocument.event_type == event_type if event_type else None,
+            DLQMessageDocument.event.event_type == event_type if event_type else None,
         ]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87516a8 and 518da12.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • backend/app/db/repositories/dlq_repository.py
  • backend/pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/db/repositories/dlq_repository.py (3)
backend/app/domain/events/event_models.py (1)
  • SortDirection (20-22)
backend/app/db/docs/dlq.py (1)
  • DLQMessageDocument (11-43)
backend/app/dlq/models.py (7)
  • TopicStatistic (119-126)
  • EventTypeStatistic (129-135)
  • AgeStatistics (138-145)
  • DLQStatistics (148-157)
  • DLQMessageStatus (12-18)
  • DLQMessage (31-52)
  • DLQTopicSummary (189-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Scan Backend
  • GitHub Check: Scan Frontend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (5)
backend/pyproject.toml (1)

126-126: monggregate 0.22.1 is compatible with existing dependencies.

Version 0.22.1 is the latest stable release and satisfies all dependency requirements: pydantic (project has 2.9.2, requires ≥2.6.0), pymongo (project has 4.12.1, requires ≥3.0.0), and typing-extensions (project has 4.12.2, requires ≥4.0). The addition aligns with the project's pinned dependency versioning approach.

backend/app/db/repositories/dlq_repository.py (4)

111-113: LGTM!

Correctly uses the nested event.event_id path and Pydantic v2's model_validate for conversion.


115-155: LGTM!

The two-stage aggregation approach is well-structured. Using $arrayToObject for status breakdown and computing averages via $divide/$round is appropriate.


157-166: LGTM!

Correctly queries by nested event.event_id and updates the appropriate fields before saving.


168-178: LGTM!

Correctly queries by nested event.event_id and updates the appropriate fields before saving.

Comment on lines +57 to +79
# Age statistics - get timestamps and sum for true average, calculate ages in Python
time_pipeline = Pipeline().group(
by=None,
query={
"oldest": S.min(S.field(DLQMessageDocument.failed_at)),
"newest": S.max(S.field(DLQMessageDocument.failed_at)),
"count": S.sum(1),
"sum_failed_at": S.sum(S.field(DLQMessageDocument.failed_at)),
},
]
age_result = []
async for doc in DLQMessageDocument.aggregate(age_pipeline):
age_result.append(doc)
age_stats_data = age_result[0] if age_result else {}
age_stats = AgeStatistics(
min_age_seconds=age_stats_data.get("min_age", 0.0),
max_age_seconds=age_stats_data.get("max_age", 0.0),
avg_age_seconds=age_stats_data.get("avg_age", 0.0),
)
time_results = await DLQMessageDocument.aggregate(time_pipeline.export()).to_list()
now = datetime.now(timezone.utc)
if time_results and time_results[0].get("oldest"):
r = time_results[0]
oldest, newest, count = r["oldest"], r["newest"], r["count"]
# sum_failed_at is sum of timestamps in ms since epoch, compute true average age
now_ms = now.timestamp() * 1000
avg_age_seconds = (now_ms * count - r["sum_failed_at"]) / count / 1000 if count > 0 else 0.0
age_stats = AgeStatistics(
min_age_seconds=(now - newest).total_seconds(),
max_age_seconds=(now - oldest).total_seconds(),
avg_age_seconds=avg_age_seconds,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

MongoDB $sum operator Date field aggregation behavior milliseconds

💡 Result:

Short answer: $sum treats BSON Date as non‑numeric and will ignore it (so it won't add ISODate values directly). To sum dates as milliseconds convert them first (e.g. {$toLong: "$date"}) and then $sum the converted longs. See MongoDB docs for $sum behavior, $toLong conversion (Date → milliseconds), and date subtraction returning milliseconds. [1][2][3]

Example (aggregation):
{ $group: { _id: null, totalMs: { $sum: { $toLong: "$date" } } } }

Sources:

  • $sum behavior (ignores non‑numeric values). [1]
  • $toLong converts Date to milliseconds since epoch. [2]
  • $subtract / date ops use milliseconds for date arithmetic. [3]

🏁 Script executed:

# Search for monggregate usage examples with datetime/date fields
rg -i "addfields|subtract" --type py -A 3 -B 1 | head -50

Repository: HardMax71/Integr8sCode

Length of output: 529


🏁 Script executed:

# Check if there are any examples in the codebase using datetime in pipelines
rg "Pipeline\(\)" --type py -A 10 | grep -i "datetime\|now\|date" | head -30

Repository: HardMax71/Integr8sCode

Length of output: 691


🏁 Script executed:

# Look at how DLQMessageDocument is used with dates
rg "DLQMessageDocument" --type py -B 2 -A 2 | grep -i "failed_at\|date\|time" | head -30

Repository: HardMax71/Integr8sCode

Length of output: 3070


🏁 Script executed:

# Look for $$NOW or similar time-related pipeline operations
rg '\$\$NOW|current_date' --type py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Check how other repositories handle age/time calculations in pipelines
rg 'timestamp\(\)' --type py -B 3 -A 3 | head -40

Repository: HardMax71/Integr8sCode

Length of output: 3297


🏁 Script executed:

# Look at the actual dlq_repository.py file to see the full context and imports
head -30 backend/app/db/repositories/dlq_repository.py

Repository: HardMax71/Integr8sCode

Length of output: 967


Incorrect fix: MongoDB pipeline cannot evaluate Python datetime objects.

The original code has a real bug—$sum on a Date field doesn't yield epoch milliseconds (it returns null/0)—but the proposed fix compounds the problem by attempting to use a Python datetime object directly in the MongoDB pipeline expression. MongoDB pipelines operate on BSON data and cannot evaluate Python objects.

To compute average age correctly, either:

  1. Calculate in Python (simplest): Remove the buggy sum_failed_at aggregation field. Use the oldest and newest timestamps already retrieved to compute all three statistics (min/max/avg ages) in Python after the aggregation.
  2. Pass milliseconds as literal: If computing in the pipeline is preferred, convert now to milliseconds before sending to MongoDB: now_ms = int(now.timestamp() * 1000), then use it as a literal value in the pipeline (not a Python object).

The min/max age calculations—(now - newest).total_seconds() and (now - oldest).total_seconds()—are already correct as-is.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/services/k8s_worker/config.py">

<violation number="1" location="backend/app/services/k8s_worker/config.py:11">
P1: Consumer group mismatch: `consumer_group` is `"kubernetes-worker"` but `GroupId.K8S_WORKER` resolves to `"k8s-worker"`. This inconsistency can cause issues with Kafka consumer group coordination. Consider using the enum value for consistency.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

# Kafka settings
consumer_group: str = "kubernetes-worker"
topics: list[KafkaTopic] = field(default_factory=lambda: [KafkaTopic.EXECUTION_TASKS])
topics: list[KafkaTopic] = field(default_factory=lambda: list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.K8S_WORKER]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Consumer group mismatch: consumer_group is "kubernetes-worker" but GroupId.K8S_WORKER resolves to "k8s-worker". This inconsistency can cause issues with Kafka consumer group coordination. Consider using the enum value for consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/services/k8s_worker/config.py, line 11:

<comment>Consumer group mismatch: `consumer_group` is `"kubernetes-worker"` but `GroupId.K8S_WORKER` resolves to `"k8s-worker"`. This inconsistency can cause issues with Kafka consumer group coordination. Consider using the enum value for consistency.</comment>

<file context>
@@ -1,14 +1,14 @@
     # Kafka settings
     consumer_group: str = "kubernetes-worker"
-    topics: list[KafkaTopic] = field(default_factory=lambda: [KafkaTopic.EXECUTION_TASKS])
+    topics: list[KafkaTopic] = field(default_factory=lambda: list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.K8S_WORKER]))
 
     # Kubernetes settings
</file context>

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/domain/enums/kafka.py (1)

117-119: Remove stale K8S_WORKER entry from unused CONSUMER_GROUP_EVENTS constant.

CONSUMER_GROUP_EVENTS is dead code—it's defined in kafka.py but never imported or used anywhere in the codebase. The GroupId.K8S_WORKER entry listing EventType.EXECUTION_STARTED is inaccurate: K8S_WORKER actually subscribes to SAGA_COMMANDS and handles CREATE_POD_COMMAND and DELETE_POD_COMMAND events. Either remove the unused CONSUMER_GROUP_EVENTS constant entirely or delete the stale K8S_WORKER entry.

🤖 Fix all issues with AI agents
In @backend/app/services/coordinator/coordinator.py:
- Line 12: Import KafkaTopic is unused in coordinator, causing a linter F401;
remove KafkaTopic from the import statement so only CONSUMER_GROUP_SUBSCRIPTIONS
and GroupId are imported (i.e., change "from app.domain.enums.kafka import
CONSUMER_GROUP_SUBSCRIPTIONS, GroupId, KafkaTopic" to omit KafkaTopic), then run
the linter/tests to confirm the warning is resolved.

In @backend/app/services/k8s_worker/worker.py:
- Around line 13-34: The import block is unsorted; group and alphabetize imports
with standard library first, third-party next, then local app imports, and
ensure all Kubernetes imports are adjacent and ordered consistently—combine
"from kubernetes import client as k8s_client" and "from kubernetes import config
as k8s_config" with "from kubernetes.client.rest import ApiException" into the
third-party group and alphabetize them (keep references to k8s_client,
k8s_config, ApiException intact), then re-run the linter to confirm I001 is
resolved.
- Line 11: Remove the unused KafkaTopic import from the import statement in
worker.py: change the line importing from app.domain.enums.kafka so it only
imports CONSUMER_GROUP_SUBSCRIPTIONS and GroupId (remove KafkaTopic) to satisfy
the linter and avoid the unused-import warning.
🧹 Nitpick comments (1)
backend/app/services/result_processor/processor.py (1)

45-48: Clarify KafkaTopic usage for linter compliance.

KafkaTopic is used in the type annotations on Lines 45 and 48. The linter flags it as unused (F401), which may be a false positive due to its use in type hints. Consider using TYPE_CHECKING import guard if the linter continues to flag this:

♻️ Alternative approach if linter persists
from __future__ import annotations
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from app.domain.enums.kafka import KafkaTopic

However, since KafkaTopic appears to be needed at runtime for Field(default=KafkaTopic.EXECUTION_RESULTS) on Line 48, the import is actually required. The linter may need a # noqa: F401 comment or configuration update.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceaad4b and 84dd3dc.

📒 Files selected for processing (7)
  • backend/app/domain/enums/kafka.py
  • backend/app/services/coordinator/coordinator.py
  • backend/app/services/k8s_worker/config.py
  • backend/app/services/k8s_worker/worker.py
  • backend/app/services/result_processor/processor.py
  • backend/app/services/sse/kafka_redis_bridge.py
  • backend/tests/integration/result_processor/test_result_processor.py
🧰 Additional context used
🧬 Code graph analysis (3)
backend/app/services/k8s_worker/config.py (1)
backend/app/domain/enums/kafka.py (2)
  • GroupId (57-69)
  • KafkaTopic (7-54)
backend/tests/integration/result_processor/test_result_processor.py (1)
backend/app/domain/events/typed.py (3)
  • EventMetadata (16-27)
  • ExecutionCompletedEvent (93-99)
  • ResultStoredEvent (193-198)
backend/app/services/coordinator/coordinator.py (2)
backend/app/domain/enums/kafka.py (2)
  • GroupId (57-69)
  • KafkaTopic (7-54)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
🪛 GitHub Actions: Ruff Linting
backend/app/services/k8s_worker/worker.py

[error] 11-11: F401: 'KafkaTopic' imported but unused. Remove unused import: app.domain.enums.kafka.KafkaTopic.


[warning] I001: Import block is un-sorted or un-formatted.

backend/app/services/result_processor/processor.py

[error] 11-11: F401: 'KafkaTopic' imported but unused. Remove unused import: app.domain.enums.kafka.KafkaTopic.


[warning] I001: Import block is un-sorted or un-formatted.

backend/app/services/coordinator/coordinator.py

[error] 12-12: F401: 'KafkaTopic' imported but unused. Remove unused import: app.domain.enums.kafka.KafkaTopic.


[warning] I001: Import block is un-sorted or un-formatted.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Scan Backend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (19)
backend/tests/integration/result_processor/test_result_processor.py (4)

11-11: LGTM!

Import correctly updated to use the unified domain event types as per the migration.


72-75: LGTM!

Good filtering by execution_id to prevent test interference from concurrent test runs or leftover events from previous tests.


109-122: LGTM!

Clean test flow: wait for the ResultStoredEvent signal before verifying DB persistence, eliminating the need for polling. The 12-second timeout is reasonable for an integration test, and the finally block ensures proper cleanup.


91-108: The review comment's concern is not valid. The ResultProcessor's consumer is explicitly configured with auto_offset_reset="earliest" at line 122 of backend/app/services/result_processor/processor.py, not the Kafka default of "latest". This means the processor will read the pre-produced event, and the test setup is correct. The comment in the test at line 92 is accurate.

Likely an incorrect or invalid review comment.

backend/app/services/sse/kafka_redis_bridge.py (3)

8-9: LGTM!

Import changes correctly migrate to the unified domain event model, importing DomainEvent and the centralized subscription mappings.


90-91: LGTM! Good use of centralized subscription configuration.

Using CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.WEBSOCKET_GATEWAY] aligns the SSE bridge with the WebSocket gateway's topic subscriptions, which is appropriate since both serve real-time client delivery. The comment clearly explains the rationale.


118-131: LGTM!

The handler signature correctly updated to accept DomainEvent, consistent with the unified event model. The event data extraction via model_dump() and execution_id lookup remains unchanged in behavior.

backend/app/domain/enums/kafka.py (2)

52-52: LGTM!

New DLQ_EVENTS topic correctly added to support DLQ telemetry event publishing as described in PR objectives.


78-80: LGTM!

The subscription change to SAGA_COMMANDS correctly reflects that K8S_WORKER now receives pod commands from the saga orchestrator rather than raw execution events. The inline comment documents the rationale well.

backend/app/services/k8s_worker/config.py (2)

11-11: LGTM!

Good refactor to derive topics dynamically from the centralized CONSUMER_GROUP_SUBSCRIPTIONS mapping. This ensures K8sWorkerConfig stays in sync with the source-of-truth topic configuration.


4-4: Keep the KafkaTopic import; it is used for the type annotation.

KafkaTopic is directly referenced on line 11 as the type annotation for the topics field: topics: list[KafkaTopic]. The import is necessary and removing it would break the type hint.

backend/app/services/k8s_worker/worker.py (2)

137-138: LGTM!

Good refactor to use centralized subscriptions. The comment clearly documents that topics come from the centralized config rather than being hardcoded.


214-224: LGTM with a note on the assert pattern.

The wrapper pattern using assert isinstance(event, ...) is consistent with the unified event model approach. The assertion provides runtime type safety while allowing the dispatcher to use a common DomainEvent signature.

Note: In production, if an unexpected event type is dispatched to these handlers, an AssertionError will be raised. This is acceptable as it indicates a configuration/routing bug that should surface during development/testing.

backend/app/services/result_processor/processor.py (4)

148-160: LGTM!

Handler wrappers correctly updated to accept DomainEvent with type assertions, consistent with the unified event model pattern used across this PR.


173-187: Good defensive improvement for optional resource_usage.

The guard if event.resource_usage: prevents AttributeError when resource_usage is None. This is a sensible change since completed events may not always have resource usage data available.


214-214: LGTM!

Graceful handling of error_type with string conversion and fallback to "unknown" is defensive and prevents potential issues if error_type is None.


11-11: No action needed—KafkaTopic is actively used.

The import on line 11 is used on lines 45 and 48 as type annotations for topics: list[KafkaTopic] and result_topic: KafkaTopic. The F401 linter warning is either absent or a false positive; the import is necessary and correctly placed.

Likely an incorrect or invalid review comment.

backend/app/services/coordinator/coordinator.py (2)

14-27: Import migration looks correct; address linter warning for sorting.

The consolidation of event imports from app.domain.events.typed aligns with the PR's goal of unifying event systems. However, the Ruff linter (I001) flags the import block as un-sorted. Consider running the formatter or sorting imports to satisfy CI.


160-160: Good refactor: centralized topic subscriptions.

Using CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR] centralizes topic configuration, improving maintainability and consistency across services.

Comment on lines +245 to +248
# Schedule immediately if at front of queue (position 0)
if position == 0:
await self._schedule_execution(event)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the QueueManager class
fd -t f "*.py" | xargs rg -l "class QueueManager"

Repository: HardMax71/Integr8sCode

Length of output: 416


🏁 Script executed:

# Find the get_next_execution method implementation
rg -n -A 20 "def get_next_execution" --type py

Repository: HardMax71/Integr8sCode

Length of output: 2094


🏁 Script executed:

# Check the add_execution method
rg -n -A 20 "def add_execution" --type py

Repository: HardMax71/Integr8sCode

Length of output: 2108


🏁 Script executed:

# Examine the _schedule_execution method in coordinator.py
sed -n '340,380p' backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 2213


🏁 Script executed:

# Check the entire _handle_execution_requested method to understand context
rg -n -B 5 -A 30 "async def _handle_execution_requested" backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 1804


🏁 Script executed:

# Look for _active_executions initialization and usage
rg -n "_active_executions" backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 1166


🏁 Script executed:

rg -n -A 60 "async def _schedule_execution" backend/app/services/coordinator/coordinator.py | head -80

Repository: HardMax71/Integr8sCode

Length of output: 3347


🏁 Script executed:

# Also check the _scheduling_loop implementation to confirm it doesn't have additional guards
rg -n -B 5 -A 20 "async def _scheduling_loop" backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 1238


🏁 Script executed:

# Check if there are any locks or deduplication mechanisms between queue removal and scheduling
rg -n "_queue_lock\|_execution_resources\|_active_executions" backend/app/services/coordinator/queue_manager.py

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

rg -n "_scheduling_semaphore" backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 201


🏁 Script executed:

# Check the semaphore initialization in __init__
rg -n -A 50 "def __init__" backend/app/services/coordinator/coordinator.py | grep -A 50 "ExecutionCoordinator"

Repository: HardMax71/Integr8sCode

Length of output: 48


🏁 Script executed:

# Get more context around semaphore initialization
sed -n '80,120p' backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 1940


🏁 Script executed:

# Verify how max_concurrent_scheduling is configured and what value it has
rg -n "max_concurrent_scheduling" backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 268


🏁 Script executed:

# Check the initialization to understand the full picture
sed -n '60,100p' backend/app/services/coordinator/coordinator.py

Repository: HardMax71/Integr8sCode

Length of output: 1678


Race condition between immediate scheduling and scheduling loop.

When position == 0, this code immediately awaits _schedule_execution(event). Concurrently, _scheduling_loop (line 301) calls get_next_execution() which removes the same event from the queue and creates a task to schedule it.

Both execution paths can enter _schedule_execution() for the same execution_id before either reaches the deduplication point at line 342 (_active_executions.add()). The guard at line 321 is checked before that addition, creating a time-of-check-time-of-use (TOCTOU) window where both paths request resource allocation concurrently.

The _scheduling_semaphore only limits concurrency count, not prevents duplicates.

Consider:

  1. Removing the event from the queue before scheduling, or
  2. Adding a check in _schedule_execution() that persists the execution_id before resource allocation, or
  3. Removing this eager scheduling and relying solely on the loop.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/k8s_worker/worker.py (1)

215-225: Replace assert with explicit type checks for production safety.

Using assert isinstance(...) for runtime type validation is fragile—assertions are stripped when Python runs with the -O flag (common in production containers). If an unexpected event type reaches these handlers, it would proceed without validation and likely fail with a confusing AttributeError later.

🔒️ Proposed fix using explicit type checks
     async def _handle_create_pod_command_wrapper(self, event: DomainEvent) -> None:
         """Wrapper for handling CreatePodCommandEvent with type safety."""
-        assert isinstance(event, CreatePodCommandEvent)
+        if not isinstance(event, CreatePodCommandEvent):
+            raise TypeError(f"Expected CreatePodCommandEvent, got {type(event).__name__}")
         self.logger.info(f"Processing create_pod_command for execution {event.execution_id} from saga {event.saga_id}")
         await self._handle_create_pod_command(event)

     async def _handle_delete_pod_command_wrapper(self, event: DomainEvent) -> None:
         """Wrapper for handling DeletePodCommandEvent."""
-        assert isinstance(event, DeletePodCommandEvent)
+        if not isinstance(event, DeletePodCommandEvent):
+            raise TypeError(f"Expected DeletePodCommandEvent, got {type(event).__name__}")
         self.logger.info(f"Processing delete_pod_command for execution {event.execution_id} from saga {event.saga_id}")
         await self._handle_delete_pod_command(event)
🧹 Nitpick comments (1)
backend/app/services/result_processor/processor.py (1)

151-161: Consider replacing assert with explicit type checks for production safety.

Using assert isinstance(...) for type validation can be problematic because assertions are disabled when Python runs with the -O (optimize) flag. If that happens, a misrouted event would cause cryptic AttributeError failures instead of a clear type error.

♻️ Suggested refactor using explicit type guard
 async def _handle_completed_wrapper(self, event: DomainEvent) -> None:
-    assert isinstance(event, ExecutionCompletedEvent)
+    if not isinstance(event, ExecutionCompletedEvent):
+        raise TypeError(f"Expected ExecutionCompletedEvent, got {type(event).__name__}")
     await self._handle_completed(event)

 async def _handle_failed_wrapper(self, event: DomainEvent) -> None:
-    assert isinstance(event, ExecutionFailedEvent)
+    if not isinstance(event, ExecutionFailedEvent):
+        raise TypeError(f"Expected ExecutionFailedEvent, got {type(event).__name__}")
     await self._handle_failed(event)

 async def _handle_timeout_wrapper(self, event: DomainEvent) -> None:
-    assert isinstance(event, ExecutionTimeoutEvent)
+    if not isinstance(event, ExecutionTimeoutEvent):
+        raise TypeError(f"Expected ExecutionTimeoutEvent, got {type(event).__name__}")
     await self._handle_timeout(event)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84dd3dc and 3c86a02.

📒 Files selected for processing (4)
  • backend/app/services/coordinator/coordinator.py
  • backend/app/services/k8s_worker/worker.py
  • backend/app/services/result_processor/processor.py
  • backend/tests/unit/services/result_processor/test_processor.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/services/coordinator/coordinator.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (8)
backend/app/services/k8s_worker/worker.py (3)

15-24: LGTM!

The import changes correctly align with the PR objective of consolidating to a single Avro-backed domain event model. The centralized subscription configuration (CONSUMER_GROUP_SUBSCRIPTIONS, GroupId) and domain-typed events are now properly sourced from app.domain.events.typed.


50-59: Formatting adjustment only.

No functional changes; parameter indentation was reformatted.


138-139: Good refactor to centralized subscription configuration.

This change correctly moves topic subscription from hardcoded values to a centralized configuration, improving maintainability and consistency across consumers.

Verify that CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.K8S_WORKER] includes the expected topics for the K8S worker (e.g., the saga commands topic containing CREATE_POD_COMMAND and DELETE_POD_COMMAND events):

#!/bin/bash
# Find the CONSUMER_GROUP_SUBSCRIPTIONS definition to verify K8S_WORKER topics
rg -n -A 20 'CONSUMER_GROUP_SUBSCRIPTIONS\s*[=:]' --type py
backend/tests/unit/services/result_processor/test_processor.py (1)

18-20: LGTM! Test assertions properly validate centralized topic mapping.

The updated assertions correctly verify that config.topics matches the centralized CONSUMER_GROUP_SUBSCRIPTIONS mapping while still ensuring EXECUTION_EVENTS is present as a sanity check. This aligns well with the PR's goal of deriving topics from a single source of truth.

backend/app/services/result_processor/processor.py (4)

46-48: LGTM! Topics derived from centralized mapping.

Using default_factory with list conversion from the centralized CONSUMER_GROUP_SUBSCRIPTIONS is correct and aligns with the single-source-of-truth approach.


174-188: LGTM! Good defensive handling of optional resource_usage.

Guarding the metrics recording with if event.resource_usage: prevents AttributeError when the field is None. The existing TODO on line 184 appropriately acknowledges the memory limit parsing brittleness.


215-215: LGTM! Defensive error_type handling.

The fallback to "unknown" ensures metrics always receive a valid label even when error_type is None.


267-296: LGTM! Event publishing aligned with unified domain model.

Both _publish_result_stored and _publish_result_failed correctly use the new EventMetadata class from the unified domain events module.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/tests/conftest.py">

<violation number="1" location="backend/tests/conftest.py:88">
P2: The cleanup code is described as 'best-effort' that 'may fail silently', but it lacks a try-except block. Any failure during cleanup will raise an exception during fixture teardown, potentially causing all tests using this fixture to report errors.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @backend/tests/conftest.py:
- Around line 85-90: The teardown's "best-effort cleanup" should be wrapped in a
try/except so failures in drop_database do not propagate; in the async with
application.router.lifespan_context(application) block, after obtaining
container via application.state.dishka_container and resolving db with await
container.get(Database), call
db.client.drop_database(test_settings.DATABASE_NAME) inside a try/except that
catches Exception and either logs the error or suppresses it (matching the "may
fail silently" behavior) to avoid masking test results.
🧹 Nitpick comments (1)
backend/tests/conftest.py (1)

27-36: Consider logging swallowed exceptions for debuggability.

The bare except Exception: pass silently hides all failures, including misconfigurations or unexpected errors. While acceptable for unit tests where Kafka is unavailable, this could mask issues in integration test runs.

♻️ Suggested improvement
 @pytest.hookimpl(trylast=True)
 def pytest_configure() -> None:
     """Create Kafka topics once in master process before xdist workers spawn."""
     # PYTEST_XDIST_WORKER is only set in workers, not master
     if os.environ.get("PYTEST_XDIST_WORKER"):
         return
     try:
         asyncio.run(create_topics(Settings(_env_file=".env.test")))
-    except Exception:
-        pass  # Kafka unavailable (unit tests)
+    except Exception as e:
+        # Kafka unavailable (unit tests) - log for debugging integration test issues
+        import logging
+        logging.getLogger(__name__).debug("Kafka topic creation skipped: %s", e)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4775e4d and 6cee5f5.

📒 Files selected for processing (5)
  • backend/.env.test
  • backend/app/settings.py
  • backend/pyproject.toml
  • backend/tests/conftest.py
  • docker-compose.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/.env.test
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/conftest.py (1)
backend/app/settings.py (1)
  • Settings (9-167)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Scan Backend
  • GitHub Check: Scan Frontend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (5)
docker-compose.yaml (1)

50-53: LGTM!

Adding ulimits.nofile for MongoDB is a sensible operational improvement. The values (65536 soft/hard) are appropriate for production-like workloads and consistent with the limits already configured for Zookeeper and Kafka in this file.

backend/pyproject.toml (2)

219-221: Good practice: Disabling OTEL SDK during tests.

Setting OTEL_SDK_DISABLED="true" is an effective way to prevent test teardown delays caused by OTLP exporter retry attempts when no collector is available. This aligns with the commit message about removing the 40s teardown timeout.

Note: If any tests specifically validate OpenTelemetry instrumentation behavior, they may need to explicitly override this setting.


126-126: monggregate version 0.22.1 is valid and current. The pinned version is available on PyPI and is the latest release, making it an appropriate dependency for MongoDB aggregation pipeline operations.

backend/app/settings.py (1)

162-167: LGTM!

The hardcoded .env default is appropriate. Tests correctly override via Settings(_env_file=".env.test") as seen in conftest.py, and production uses the default. This simplifies configuration by removing the unnecessary DOTENV_PATH indirection.

backend/tests/conftest.py (1)

40-65: LGTM!

The per-worker isolation strategy is well-designed. Using Settings(_env_file=".env.test") with model_copy(update={...}) for dynamic overrides cleanly separates static test config from runtime isolation concerns. The docstring clearly documents what's shared vs. isolated.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 14 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/events/core/types.py">

<violation number="1" location="backend/app/events/core/types.py:81">
P1: New `request_timeout_ms` field is not used in `to_consumer_config()`. The field is added but the conversion method doesn't include `"request.timeout.ms": self.request_timeout_ms`, so this setting will have no effect on the actual Kafka consumer.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/dlq/manager.py (1)

206-255: Potential partial failure when sending to both topics.

The method sends to both retry_topic (lines 222-227) and original_topic (lines 230-235) without transactional guarantees. If the first send succeeds but the second fails, the system will be in an inconsistent state—the retry topic has the message but the original topic doesn't, yet the exception will prevent the status update.

Consider wrapping these in a try/except to handle partial failures gracefully, or document this as acceptable behavior given the retry nature of this code.

backend/app/events/core/types.py (1)

81-105: Add missing request_timeout_ms to to_consumer_config() for consistency.

The to_consumer_config() method does not include the request_timeout_ms field (line 81), even though it exists in the ConsumerConfig dataclass. While production code currently uses direct field access (bypassing this method), the mapping should be added for consistency and to prevent silent field drops if the method is ever used.

🐛 Proposed fix to add missing mapping
     def to_consumer_config(self) -> dict[str, object]:
         """Convert to Confluent Kafka consumer configuration."""
         return {
             "bootstrap.servers": self.bootstrap_servers,
             "group.id": self.group_id,
             "client.id": self.client_id,
             "auto.offset.reset": self.auto_offset_reset,
             "enable.auto.commit": self.enable_auto_commit,
             "session.timeout.ms": self.session_timeout_ms,
             "heartbeat.interval.ms": self.heartbeat_interval_ms,
             "max.poll.interval.ms": self.max_poll_interval_ms,
+            "request.timeout.ms": self.request_timeout_ms,
             "fetch.min.bytes": self.fetch_min_bytes,
             "fetch.wait.max.ms": self.fetch_max_wait_ms,
             "statistics.interval.ms": self.statistics_interval_ms,
         }
🤖 Fix all issues with AI agents
In @backend/.env.test:
- Around line 28-33: Update the KAFKA_REQUEST_TIMEOUT_MS in the test env so it
is greater than KAFKA_SESSION_TIMEOUT_MS: change KAFKA_REQUEST_TIMEOUT_MS
(currently 5000) to at least 8000 to maintain a safe margin relative to
KAFKA_SESSION_TIMEOUT_MS=6000 while keeping SSE_CONSUMER_POOL_SIZE and
KAFKA_HEARTBEAT_INTERVAL_MS as-is.

In @backend/app/services/event_bus.py:
- Around line 96-99: The Kafka timeout settings violate required relationships:
request_timeout_ms (KAFKA_REQUEST_TIMEOUT_MS) must be less than
session_timeout_ms (KAFKA_SESSION_TIMEOUT_MS); currently 40000 ≥ 30000. Fix by
adjusting the settings so KAFKA_REQUEST_TIMEOUT_MS < KAFKA_SESSION_TIMEOUT_MS
(e.g., lower KAFKA_REQUEST_TIMEOUT_MS to <30000) or by increasing
KAFKA_SESSION_TIMEOUT_MS above the request timeout, and ensure the values loaded
into session_timeout_ms, heartbeat_interval_ms, max_poll_interval_ms, and
request_timeout_ms in event_bus.py respect the documented constraints.
♻️ Duplicate comments (4)
backend/app/services/notification_service.py (1)

334-334: Blocking notification delivery may increase API latency.

This change from asyncio.create_task() to await blocks the caller until delivery completes, including external HTTP calls with 30s timeouts. If intentional for delivery guarantees, document the tradeoff; otherwise, restore the background task pattern.

backend/app/dlq/manager.py (2)

72-76: Previous concerns about null safety and error handling remain valid.

The issues flagged in previous reviews still apply:

  1. v.decode() on line 75 will crash if header value is None
  2. json.loads() and DLQMessage(**data, ...) lack error handling for malformed messages

107-137: Async iteration pattern is clean, but error backoff concern remains.

The refactoring to async for msg in self.consumer: is a cleaner approach than manual polling. However, the missing error backoff after exception on line 137 was already flagged in a previous review—without a sleep, persistent errors (e.g., database unavailable) will cause rapid spinning.

backend/app/services/coordinator/coordinator.py (1)

246-249: Off-by-one error: queue positions are 1-indexed.

The condition position == 0 will never match a legitimate front-of-queue scenario. Per QueueManager._get_queue_position, positions are 1-indexed (first item = position 1). The value 0 only occurs when position was None (error case) and defaulted to 0 at line 236.

To schedule immediately when at the front of the queue, the condition should be position == 1.

Additionally, even with the fix, there's a potential TOCTOU race with _scheduling_loop that could schedule the same execution twice before _active_executions.add() is reached.

🐛 Proposed fix for off-by-one
-            # Schedule immediately if at front of queue (position 0)
-            if position == 0:
+            # Schedule immediately if at front of queue (position 1, since 1-indexed)
+            if position == 1:
                 await self._schedule_execution(event)
🧹 Nitpick comments (2)
backend/app/dlq/manager.py (1)

389-416: Consider batch concurrency for large retry operations.

The sequential for event_id in event_ids loop processes retries one at a time. For large batches, this could be slow. Consider using asyncio.gather with a semaphore to limit concurrency:

async def retry_messages_batch(self, event_ids: list[str], max_concurrent: int = 10) -> DLQBatchRetryResult:
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def retry_with_limit(event_id: str) -> DLQRetryResult:
        async with semaphore:
            try:
                success = await self.retry_message_manually(event_id)
                return DLQRetryResult(event_id=event_id, status="success" if success else "failed", 
                                      error=None if success else "Retry failed")
            except Exception as e:
                return DLQRetryResult(event_id=event_id, status="failed", error=str(e))
    
    results = await asyncio.gather(*[retry_with_limit(eid) for eid in event_ids])
    # ... aggregate results
backend/app/services/k8s_worker/worker.py (1)

219-229: Reasonable runtime type narrowing pattern.

The wrapper methods accept DomainEvent (the union type) and use assert isinstance(...) to narrow to the specific event type. This is acceptable since the dispatcher guarantees routing by event_type, making the assertion a defensive check rather than validation logic.

However, if assertions are disabled (python -O), this would silently pass incorrect types. Consider using an explicit check with an exception if stricter runtime safety is desired.

♻️ Optional: Explicit type check instead of assert
 async def _handle_create_pod_command_wrapper(self, event: DomainEvent) -> None:
     """Wrapper for handling CreatePodCommandEvent with type safety."""
-    assert isinstance(event, CreatePodCommandEvent)
+    if not isinstance(event, CreatePodCommandEvent):
+        raise TypeError(f"Expected CreatePodCommandEvent, got {type(event).__name__}")
     self.logger.info(f"Processing create_pod_command for execution {event.execution_id} from saga {event.saga_id}")
     await self._handle_create_pod_command(event)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cee5f5 and 0cce76f.

📒 Files selected for processing (14)
  • backend/.env.test
  • backend/app/dlq/manager.py
  • backend/app/events/consumer_group_monitor.py
  • backend/app/events/core/consumer.py
  • backend/app/events/core/types.py
  • backend/app/events/event_store_consumer.py
  • backend/app/services/coordinator/coordinator.py
  • backend/app/services/event_bus.py
  • backend/app/services/k8s_worker/worker.py
  • backend/app/services/notification_service.py
  • backend/app/services/result_processor/processor.py
  • backend/app/services/saga/saga_orchestrator.py
  • backend/app/services/sse/kafka_redis_bridge.py
  • backend/app/settings.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • backend/app/events/event_store_consumer.py
  • backend/app/settings.py
  • backend/app/services/result_processor/processor.py
🧰 Additional context used
🧬 Code graph analysis (5)
backend/app/services/coordinator/coordinator.py (6)
backend/app/domain/enums/kafka.py (1)
  • GroupId (57-69)
backend/app/domain/events/typed.py (2)
  • CreatePodCommandEvent (380-395)
  • ExecutionRequestedEvent (46-60)
backend/app/events/core/types.py (1)
  • ConsumerConfig (66-105)
backend/app/events/core/dispatcher.py (1)
  • EventDispatcher (15-177)
backend/app/events/core/consumer.py (2)
  • UnifiedConsumer (23-257)
  • start (45-71)
backend/app/events/schema/schema_registry.py (1)
  • SchemaRegistryManager (35-119)
backend/app/services/sse/kafka_redis_bridge.py (4)
backend/app/domain/enums/kafka.py (1)
  • GroupId (57-69)
backend/app/settings.py (1)
  • Settings (9-170)
backend/app/core/metrics/events.py (1)
  • EventMetrics (4-209)
backend/app/services/sse/redis_bus.py (1)
  • SSERedisBus (44-89)
backend/app/services/saga/saga_orchestrator.py (3)
backend/app/domain/events/typed.py (2)
  • EventMetadata (16-27)
  • SagaCancelledEvent (349-358)
backend/app/events/event_store_consumer.py (1)
  • _handle_event (111-119)
backend/app/services/saga/base_saga.py (1)
  • BaseSaga (8-52)
backend/app/services/notification_service.py (1)
backend/app/domain/events/typed.py (3)
  • ExecutionCompletedEvent (93-99)
  • ExecutionFailedEvent (102-110)
  • ExecutionTimeoutEvent (113-119)
backend/app/services/k8s_worker/worker.py (6)
backend/app/domain/enums/kafka.py (1)
  • GroupId (57-69)
backend/app/domain/events/typed.py (5)
  • CreatePodCommandEvent (380-395)
  • DeletePodCommandEvent (398-404)
  • ExecutionFailedEvent (102-110)
  • ExecutionStartedEvent (78-83)
  • PodCreatedEvent (133-137)
backend/app/services/k8s_worker/config.py (1)
  • K8sWorkerConfig (8-35)
backend/app/events/core/producer.py (2)
  • producer (56-57)
  • UnifiedProducer (23-218)
backend/app/events/core/consumer.py (1)
  • start (45-71)
backend/app/services/idempotency/middleware.py (1)
  • start (263-271)
🪛 dotenv-linter (4.0.0)
backend/.env.test

[warning] 31-31: [UnorderedKey] The KAFKA_SESSION_TIMEOUT_MS key should go before the SSE_CONSUMER_POOL_SIZE key

(UnorderedKey)


[warning] 32-32: [UnorderedKey] The KAFKA_HEARTBEAT_INTERVAL_MS key should go before the KAFKA_SESSION_TIMEOUT_MS key

(UnorderedKey)


[warning] 33-33: [UnorderedKey] The KAFKA_REQUEST_TIMEOUT_MS key should go before the KAFKA_SESSION_TIMEOUT_MS key

(UnorderedKey)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Scan Backend
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Integration Tests
  • GitHub Check: E2E Tests
🔇 Additional comments (30)
backend/app/services/notification_service.py (4)

22-27: LGTM!

Import changes correctly migrate from the infrastructure Kafka events layer to the unified domain events as per the PR objectives.


648-660: LGTM!

The unified event handler correctly uses DomainEvent type hint and dispatches to specific typed handlers via isinstance checks. The defensive error handling and logging for unhandled event types are appropriate.


224-228: Good addition of Kafka consumer timeout settings.

These configuration parameters improve consumer resilience by allowing environment-specific tuning. The Settings class defines all required attributes with appropriate defaults: KAFKA_SESSION_TIMEOUT_MS (30000ms), KAFKA_HEARTBEAT_INTERVAL_MS (3000ms), KAFKA_MAX_POLL_INTERVAL_MS (300000ms), and KAFKA_REQUEST_TIMEOUT_MS (40000ms).


635-636: LGTM!

Safe handling of optional resource_usage with fallback to 0.0. The execution_time_wall_seconds attribute exists in ResourceUsageDomain as a float with a default value of 0.0, and the duration extraction and formatting are correct.

backend/app/dlq/manager.py (9)

69-70: LGTM!

The initialization of _dlq_events_topic with proper prefix and _event_metadata for lifecycle events is clean and follows the domain event model pattern.


139-168: LGTM!

The message processing flow with filters, storage, retry policy evaluation, and status updates is well-structured. The transition to nested message.event.* field access is consistent throughout.


170-183: LGTM!

The upsert pattern using Beanie with nested field lookup is correct, and emitting the received event after storage ensures events are only published for successfully stored messages.


257-273: LGTM!

The discard flow properly updates metrics, persists status, emits the lifecycle event, and logs the action.


275-304: LGTM!

The monitor loop correctly uses is_running for graceful shutdown, has proper error backoff (60s sleep), and the Beanie query for scheduled messages is well-structured. Note that from_attributes=True is the correct approach for validating Beanie documents into Pydantic models.


360-372: LGTM with note on error handling.

The _produce_dlq_event method correctly uses schema registry for Avro serialization. The error handling that logs and swallows exceptions is appropriate for telemetry/lifecycle events—failure to emit these events should not disrupt core DLQ processing.


374-387: LGTM!

Proper state validation guards against retrying messages in terminal states (DISCARDED, RETRIED), and the nested field query is consistent with the new model.


418-440: LGTM!

The manual discard implementation mirrors the retry pattern with proper state validation and clear docstring.


443-484: LGTM!

The factory function now properly configures Kafka client settings from the application settings, including session timeout, heartbeat interval, max poll interval, and request timeout. The producer configuration with enable_idempotence=True ensures exactly-once semantics for retried messages.

backend/.env.test (1)

43-45: LGTM! Clear documentation of OpenTelemetry disabling mechanism.

The updated comments effectively explain the dual safety mechanism (empty endpoint + OTEL_SDK_DISABLED flag) for disabling OpenTelemetry in tests. This improves developer understanding of the test configuration.

backend/app/services/sse/kafka_redis_bridge.py (3)

74-77: LGTM! Kafka timeout settings are now configurable.

The consumer configuration now correctly uses settings-based timeout values (KAFKA_MAX_POLL_INTERVAL_MS, KAFKA_SESSION_TIMEOUT_MS, KAFKA_HEARTBEAT_INTERVAL_MS, KAFKA_REQUEST_TIMEOUT_MS) instead of hardcoded defaults, enabling environment-specific tuning.


91-92: LGTM! Centralized topic subscriptions.

Using CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.WEBSOCKET_GATEWAY] aligns with the PR's goal of centralizing consumer group→topic mappings and ensures the SSE bridge subscribes to the same topics as the WebSocket gateway.


119-126: LGTM! DomainEvent type migration.

The route_event handler now correctly accepts DomainEvent, consistent with the unified event model. The model_dump() call and execution_id extraction remain functionally equivalent.

backend/app/services/saga/saga_orchestrator.py (3)

122-126: LGTM! Kafka timeout settings centralized.

The consumer configuration now correctly sources timeout values from settings, providing consistency with other consumers in the system.


155-183: LGTM! DomainEvent type migration in event handlers.

The _handle_event, _should_trigger_saga methods now correctly use DomainEvent type, aligning with the unified event model. The logic remains unchanged.


185-239: LGTM! Saga trigger and execution methods updated.

The _start_saga and _execute_saga methods now accept DomainEvent as the trigger event type. The getattr(trigger_event, "execution_id", None) pattern safely handles dynamic attribute access on the union type.

backend/app/services/coordinator/coordinator.py (2)

115-118: LGTM! Kafka timeout settings centralized.

The consumer configuration now correctly uses settings-based timeout values, consistent with other consumers in the codebase.


161-161: LGTM! Centralized topic subscriptions.

Using CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR] correctly centralizes topic management.

backend/app/services/k8s_worker/worker.py (2)

111-114: LGTM! Kafka timeout settings centralized.

The consumer configuration now correctly uses settings-based timeout values, consistent with other consumers.


142-143: LGTM! Centralized topic subscriptions.

Using CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.K8S_WORKER] aligns with the centralized subscription model.

backend/app/events/consumer_group_monitor.py (2)

150-150: LGTM!

Storing the settings reference enables consistent timeout configuration for the lag-monitor consumer created in _get_consumer_group_lag.


382-390: LGTM!

The timeout configuration is now consistent with the centralized settings. For this ephemeral lag-monitoring consumer, the request_timeout_ms is the most relevant setting for controlling how long offset fetch operations wait.

backend/app/events/core/consumer.py (4)

15-15: LGTM!

Import correctly updated to use DomainEvent from the unified domain events module as part of the migration.


41-41: LGTM!

Type annotation correctly updated to DomainEvent, consistent with the deserialized event type used when invoking the callback (line 192).


194-195: LGTM!

The callback signature is correctly updated to use DomainEvent, consistent with the field type annotation on line 41.


60-60: LGTM!

Adding request_timeout_ms provides explicit control over broker response timeouts, complementing the existing session/heartbeat/poll interval settings. The field is properly defined in ConsumerConfig with a 40-second default for production use.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="docs/testing/kafka-test-stability.md">

<violation number="1" location="docs/testing/kafka-test-stability.md:28">
P2: Broken MkDocs snippet directive. The `--8<--` syntax is used by pymdownx.snippets to include external code. Adding spaces (`--8 < --`) breaks this directive and the code snippet will not be rendered.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @docs/testing/kafka-test-stability.md:
- Line 28: The snippet inclusion marker currently has spaces around the '<'
operator (`--8 < --`); remove the spaces so it uses the standard marker `--8<--
"..."` (i.e., change `--8 < --` to `--8<-- "..."`) so the snippet will be
correctly included in the rendered documentation.
🧹 Nitpick comments (2)
docs/reference/environment-variables.md (1)

58-61: Consider documenting timeout relationships.

While the defaults are correct, consider adding a brief note explaining the required relationships between these Kafka timeout values for users who need to customize them. For example:

  • heartbeat_interval_ms should typically be ≤ 1/3 of session_timeout_ms
  • request_timeout_ms should be > session_timeout_ms
  • max_poll_interval_ms should be > session_timeout_ms
📝 Optional: Add a note about timeout relationships

You could add a note below the Kafka Configuration table:

## Kafka Configuration

| Variable                      | Default                       | Description                            |
|-------------------------------|-------------------------------|----------------------------------------|
| ...                           | ...                           | ...                                    |

> **Note:** When customizing Kafka timeouts, ensure: `heartbeat_interval_ms``session_timeout_ms` / 3, 
> `request_timeout_ms` > `session_timeout_ms`, and `max_poll_interval_ms` > `session_timeout_ms`.
docs/testing/kafka-test-stability.md (1)

85-87: Fix line break within sentence.

The word "or" is isolated on line 86, breaking sentence flow.

📝 Proposed fix
-The `request_timeout_ms` parameter in aiokafka defaults to **40000ms** (40 seconds). When the Kafka coordinator is slow
-or
-unresponsive during test teardown, the consumer waits the full timeout before giving up.
+The `request_timeout_ms` parameter in aiokafka defaults to **40000ms** (40 seconds). When the Kafka coordinator is slow or unresponsive during test teardown, the consumer waits the full timeout before giving up.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cce76f and 61cf497.

📒 Files selected for processing (2)
  • docs/reference/environment-variables.md
  • docs/testing/kafka-test-stability.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/testing/kafka-test-stability.md

25-25: Spaces inside link text

(MD039, no-space-in-links)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: E2E Tests
  • GitHub Check: Integration Tests
🔇 Additional comments (3)
docs/reference/environment-variables.md (1)

46-62: Documentation updates verified—all Kafka defaults match code.

All documented Kafka configuration defaults (KAFKA_BOOTSTRAP_SERVERS, SCHEMA_REGISTRY_URL, SCHEMA_REGISTRY_AUTH, ENABLE_EVENT_STREAMING, EVENT_RETENTION_DAYS, KAFKA_TOPIC_PREFIX, KAFKA_GROUP_SUFFIX, KAFKA_CONSUMER_GROUP_ID, KAFKA_AUTO_OFFSET_RESET, KAFKA_ENABLE_AUTO_COMMIT, KAFKA_SESSION_TIMEOUT_MS, KAFKA_HEARTBEAT_INTERVAL_MS, KAFKA_MAX_POLL_INTERVAL_MS, KAFKA_REQUEST_TIMEOUT_MS, KAFKA_MAX_POLL_RECORDS) match their corresponding values in backend/app/settings.py. The three new timeout variables are properly defined in code, follow Kafka best practices (heartbeat < session < request), and the explicit "(ms)" unit annotations improve clarity.

docs/testing/kafka-test-stability.md (2)

71-130: Excellent documentation addition!

The new "Consumer teardown delays" section is comprehensive and well-structured. It provides:

  • Clear problem statement with concrete error examples
  • Root cause explanation with external issue reference
  • Practical configuration fix
  • Before/after metrics showing dramatic improvement (40s → <1s)
  • Detailed timeout explanations table

This will be very helpful for developers encountering similar issues.


99-102: Kafka heartbeat interval at configuration boundary.

The heartbeat interval (2000ms) equals exactly session_timeout_ms / 3 (6000 / 3), which is the outer edge of the recommended range (heartbeat.interval.ms ≲ 1/3 of session.timeout_ms). This leaves no margin if a heartbeat is missed before rebalance triggers. Adjust to KAFKA_HEARTBEAT_INTERVAL_MS=1800 or KAFKA_SESSION_TIMEOUT_MS=6500 to add safety margin.

@sonarqubecloud
Copy link

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 11 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/events/core/producer.py">

<violation number="1" location="backend/app/events/core/producer.py:83">
P3: The `client_id` format string is duplicated. Consider storing it as an instance variable during `_on_start()` or as a property to ensure consistency between the actual producer configuration and what `get_status()` reports.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

"batch_size": self._config.batch_size,
"compression_type": self._config.compression_type,
"bootstrap_servers": self._settings.KAFKA_BOOTSTRAP_SERVERS,
"client_id": f"{self._settings.SERVICE_NAME}-producer",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The client_id format string is duplicated. Consider storing it as an instance variable during _on_start() or as a property to ensure consistency between the actual producer configuration and what get_status() reports.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/events/core/producer.py, line 83:

<comment>The `client_id` format string is duplicated. Consider storing it as an instance variable during `_on_start()` or as a property to ensure consistency between the actual producer configuration and what `get_status()` reports.</comment>

<file context>
@@ -62,28 +61,26 @@ async def _on_start(self) -> None:
-                "batch_size": self._config.batch_size,
-                "compression_type": self._config.compression_type,
+                "bootstrap_servers": self._settings.KAFKA_BOOTSTRAP_SERVERS,
+                "client_id": f"{self._settings.SERVICE_NAME}-producer",
             },
             "metrics": {
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants